SlideShare a Scribd company logo
Big Data
Analysis Patterns
Atlanta Big Data User Group
8/15/2013
1
whoami
•

Brad Anderson

•

Solutions Architect at MapR (Atlanta)

•

ATLHUG co-chair

•

NoSQL East Conference 2009

•

“boorad” most places (twitter, github)

•

banderson@maprtech.com
2
Announcements


Next ATLHUG Meeting - Sept. 26
– How Google Does Big Data



Wednesday – MapR Data Warehouse Offload
Roadshow



MapR Upcoming Training
•
•
•

3

MapR M7 & HBase for Developers on August 27 in Campbell, CA
MapR M7 & HBase for Developers on Sept 17 in Reston, VA
MapR M5 for Administrators on Oct 3 in Campbell, CA

3
BIG DATA
4
5
Big Data is not new!
but the tools are.

6
The Good News in Big Data:

“Simple algorithms and lots of data
trump complex models”

Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
7
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need…



Apache Mahout?



Storm?



Apache Solr/Lucene?



Apache HBase (or MapR M7)?



Apache Drill (or Impala?)



d3.js or Tableau?



Node.js


8

Apache Hadoop?

Titan?
8
Ask a Different Question
It may be more useful to better define the problem by asking some
of these questions:



How large is the data to be queried? (the analysis volume)



What time frame is appropriate for your query response?



How fast is data arriving? (bursts or continuously?)



Are queries by sophisticated users?



Are you looking for common patterns or outliers?



9

How large is the data to be stored?

How are your data sources structures?

9
Picking the Best Solution
Your responses to these questions can help you better:


define the problem



recognize the analysis pattern to which it belongs



guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you
might choose, and then we will focus on three of the questions as a
part of the landscape.

10

10
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily
indexed queries including data such as


Full text



Geographical data



Statistically weighted data

Solr is a small data tool that has flourished in a big data world

11
Apache Mahout
Mahout provides a library of scalable machine learning algorithms
useful for big data analysis based on Hadoop or other storage
systems.

Mahout algorithms mainly are used for


Recommendation (collaborative filtering)



Clustering



Classification

Mahout can be used in conjunction with solutions such as Solr: You
might use Mahout to create a co-occurrence data base that could
then be queried using a search tool such as Solr

12
Apache Drill


Google Dremel clone



Pluggable Query Languages
–
–



Pluggable Storage Backends
–
–
–



Starts with ANSI SQL 2003
Hive, Pig, Cascading, MongoQL, …
Hadoop, Hbase
MongoDB (BSON)
RDBMS?

Bypasses MapReduce

13
Storm


Realtime Stream Computation Engine



Horizontal Scalability



Guaranteed Data Processing



Fault Tolerance



Higher level abstraction over:
–

–



Message Queues
Worker Logic

“The Hadoop of Realtime”

14
Titan


Distributed Graph Database



Property Graph



Pluggable Backend Storage
–
–
–



Search Integrated
–
–



Solr/Lucene
Elastic Search

Faunus
–



HBase or M7
Cassandra
Berkeley DB

Batch processing of large graphs

Fulgora
–
–

Graph traversals on subset
In-memory
15
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions:


How large is the data to be stored?



How large is the data to be queried? (the analysis volume)



What time frame is appropriate for your query response?

16
Big Data Decision Tree
How big is your data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

What size queries?
One pass
over 100%

B

Response time?

C

Big storage

Multiple passes
over big chunks

Streaming

< 100s
(human scale)
D
17

throughput
not response
E
Use Cases
Company
 Data Shape
 Technique(s)
 Business Value


18
Business Value
19
Business Value
20
Telecommunications Giant

ETL Offload
21
Telecommunications






Data Shape

Lots of Data
Lots of Queries across Large Sets
Throughput important

22
Telecommunications

Techniques
Analytics

ETL

23
Telecommunications

Techniques

+
ETL (Hadoop)

Analytics (Teradata)
24
Telecommunications

Business Value

25
Credit Card
Issuer

26
Credit Card
Issuer

Data Shape








Customer Purchase History (big)
Merchant Designations
Merchant Special Offers
Throughput important
Recommendations
27
Credit Card
Issuer

Techniques
A Recommendation Engine with Mahout and Solr/Lucene

History matrix
One row per user
One column per thing
28
Credit Card
Issuer

Techniques
Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and column per thing
29
Credit Card
Issuer

Techniques
Cooccurrence matrix can also be
implemented as a search index

30
Credit Card
Issuer

Techniques
Complete
history

Cooccurrence
(Mahout)

SolR
SolR
Indexer
Solr
Indexer
indexing

Item metadata

Index
shards

31

20 Hrs  3 Hrs
Credit Card
Issuer

Techniques
User
history

SolR
SolR
Indexer
Solr
Indexer
search

Web tier

8Hrs  3 Min

Item metadata

Index
shards

32
Credit Card
Issuer

Techniques
Hadoop
Purchase
History

Export
(4 hrs)

App
App

Merchant
Information

Recommendation
Engine Results
(Mahout)

Presentation
Data Store
(DB2)

App
App

Merchant
Offers

App

Import
(4 hrs)
33
Credit Card
Issuer

Techniques
Hadoop
Purchase
History
Merchant
Information

Recommendation
Engine Results
(Mahout)

Index
Update
(3 min)

App
App

Recommendation
Search Index
(Solr)

App
App

Merchant
Offers

App

34
Credit Card
Issuer

Business Value

35
Waste & Recycling Leader

Idle Alerts
36
Data Shape
Truck Geolocation Data
– 20,000 trucks
– 5 sec interval (arriving quickly)
 Landfill Geographic Boundaries


37
Techniques
Realtime Stream Computation
(Storm)

Truck
Geolocation

Data

Hadoop
Storage

Immediate
Alerts

Batch Computation
(MapReduce)

Tax Reduction
Reporting

Shortest Path
Graph Algorithm
(Titan)

Route
Optimization

38
Business Value

39
Beverage Company

Social Engagement Application

40
Data Shape

Tweets, FB Messages
 Person, Activity links
 Graph Traversal


41
Consumer Activity Graph
Wal*Mart.com
Ebay
Shopping.com
Sam’s
Ebay Motors
Dollar General
StubHub
CVS

42

Toys R Us
Techniques
Property Graph
(Titan)

Social
Activity
Stream
Key/Value Store
(MapR M7)

43

Graph Traversal
(Faunus/Fulgora)
Business Value

44
Fraud Detection
Data Lake
45
Data Sources



Anti-Money Laundering
Consumer Transactions

46
Techniques
Anti-Money Laundering
System

Consumer Transactions
System

47
Techniques
AML
Data Lake
(Hadoop)

Suspicious
Events

Consumer
Transactions

Analyst
Latent Dirichlet Allocation,
Bayesian Learning Neural Network,
Peer Group Analysis
48
Business Value

49
Machine Learning
Search Relevance
DNA Matching
50
Data Sources

Birth, Death, Census, Military, I
mmigration records
 Search Behavior Activity
 DNA SNP (snips)


51
Techniques
Record Linking
 Search Relevance
 Clickstream Behavior
 Security Forensics
 DNA Matching


52
Business Value

53
Traffic Analytics
54
Data Sources


Inrix Road Segment Data

Avg Speed / minute / segment
– Reference Speeds
–



Road Segment Geolocation Data
55
Techniques
 Bottleneck Detection Algorithm
 Time Offset Correlations
–



Alternate Routes

Predictive Congestion Analysis

–

Growth & Term Assumptions
56
57
58
Business Value

59
Similar Characteristics
Lots of Data
 Structured, Semi-Structured, Unstructured
 Varied Systems Interoperating
– Hadoop, Storm, Solr, MPP, Visualizations


Increase Revenue
 Decrease Costs


60
Questions?

61

More Related Content

What's hot (20)

PDF
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
PDF
Introduction to Big Data
IMC Institute
 
ODP
Big Data Analytics - Introduction
Alex Meadows
 
PDF
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
PPTX
Presentation on Big Data Analytics
S P Sajjan
 
PDF
Big data analytics with Apache Hadoop
Suman Saurabh
 
PDF
Introduction to Big Data
Haluan Irsad
 
PPTX
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
PDF
Introduction to Big Data
Kristof Jozsa
 
PDF
The Future Of Big Data
Matthew Dennis
 
PPTX
Big Data Analytics MIS presentation
AASTHA PANDEY
 
PDF
Big data Big Analytics
Ajay Ohri
 
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
PDF
Big data landscape
Natalino Busa
 
PPTX
Big Data & Data Science
BrijeshGoyani
 
PPTX
Introduction to Big Data
Vipin Batra
 
PPTX
Big data unit 2
RojaT4
 
PPTX
Big Data Tutorial V4
Marko Grobelnik
 
PPSX
Big Data
Neha Mehta
 
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Introduction to Big Data
IMC Institute
 
Big Data Analytics - Introduction
Alex Meadows
 
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Presentation on Big Data Analytics
S P Sajjan
 
Big data analytics with Apache Hadoop
Suman Saurabh
 
Introduction to Big Data
Haluan Irsad
 
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
Introduction to Big Data
Kristof Jozsa
 
The Future Of Big Data
Matthew Dennis
 
Big Data Analytics MIS presentation
AASTHA PANDEY
 
Big data Big Analytics
Ajay Ohri
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Big data landscape
Natalino Busa
 
Big Data & Data Science
BrijeshGoyani
 
Introduction to Big Data
Vipin Batra
 
Big data unit 2
RojaT4
 
Big Data Tutorial V4
Marko Grobelnik
 
Big Data
Neha Mehta
 

Similar to Big Data Analysis Patterns with Hadoop, Mahout and Solr (20)

PPTX
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
PPTX
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
OCTO Technology
 
PPTX
Big data presentationandoverview_of_couchbase
AMAR NATH
 
PPTX
A Glimpse of Bigdata - Introduction
saisreealekhya
 
PDF
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
PDF
Big Data & Open Source - Neil Jadhav
Swapnil (Neil) Jadhav
 
PDF
Capturing big value in big data
BSP Media Group
 
PPTX
Hadoop as data refinery
Steve Loughran
 
PPTX
Hadoop as Data Refinery - Steve Loughran
JAX London
 
PPTX
Stratebi Big Data
Stratebi
 
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
 
PPTX
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
SoftServe
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PPTX
Big data4businessusers
Bob Hardaway
 
PDF
Big Data Architecture
Guido Schmutz
 
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
OCTO Technology
 
Big data presentationandoverview_of_couchbase
AMAR NATH
 
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
Big Data & Open Source - Neil Jadhav
Swapnil (Neil) Jadhav
 
Capturing big value in big data
BSP Media Group
 
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Stratebi Big Data
Stratebi
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
SoftServe
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Big data4businessusers
Bob Hardaway
 
Big Data Architecture
Guido Schmutz
 
Ad

More from boorad (11)

PPTX
Hadoop and Storm - AJUG talk
boorad
 
PDF
Realtime Computation with Storm
boorad
 
PPTX
Big Data Use Cases
boorad
 
PPTX
PhillyDB Talk - Beyond Batch
boorad
 
KEY
TriHUG - Beyond Batch
boorad
 
KEY
Realtime Computation with Storm
boorad
 
KEY
Large Scale Data Analysis Tools
boorad
 
KEY
DevNexus 2011
boorad
 
KEY
DevNation Atlanta
boorad
 
KEY
NOSQL, CouchDB, and the Cloud
boorad
 
PDF
Why Erlang? - Bar Camp Atlanta 2008
boorad
 
Hadoop and Storm - AJUG talk
boorad
 
Realtime Computation with Storm
boorad
 
Big Data Use Cases
boorad
 
PhillyDB Talk - Beyond Batch
boorad
 
TriHUG - Beyond Batch
boorad
 
Realtime Computation with Storm
boorad
 
Large Scale Data Analysis Tools
boorad
 
DevNexus 2011
boorad
 
DevNation Atlanta
boorad
 
NOSQL, CouchDB, and the Cloud
boorad
 
Why Erlang? - Bar Camp Atlanta 2008
boorad
 
Ad

Recently uploaded (20)

PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
July Patch Tuesday
Ivanti
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 

Big Data Analysis Patterns with Hadoop, Mahout and Solr