SlideShare a Scribd company logo
Using Hadoop for Cognitive Analytics
Pedro Desouza, Ph.D.
Associate Partner
Big Data & Analytics Center of Competence
IBM Global Business Services
June 29, 2016
© 2016 IBM Corporation
Global Business Services
Outline
2
P Metro Pulse: Enhancing Decision Making Processes With Hyperlocal Data
DashboardsP
Use Cases In Multiple IndustriesP
Geographic Hierarchies, External Metrics, and Mapping RepresentationP
Integration Of External and Customer-Specific MetricsP
Solution ArchitectureP
Technological ComponentsP
Micro Services for Data Ingestion and CurationP
© 2016 IBM Corporation
Global Business Services
Improving Decision Making Accuracy by Combining Business
Metrics with Hyperlocal Data
3
Weather
Social Media Sentiment
Economics…
Events
Thousands of them together, on a
single repository
Other Points of Interests
Subway Stations
Demographics
Hyperlocal Data
Business decision can be made on
precise hyperlocal context for
each store
Store Context
Combiningbusinessmetricsofeachstore
withhyperlocaldataprovidesinsightsvia
visualinspectionandadvancedanalytics
Demand Forecast, Marketing
Campaign, Distribution Plan
and many other business
decisions are usually based on
aggregate levels of data that
don’t precisely consider the
context where the business
operates.
Stores in London
© 2016 IBM Corporation
Global Business Services
Improving Forecast Accuracy with External Data
4
Traditional Method: Neuron Net, ARIMA…
Forecast based on Neural Network
with External Data:
23.9% better accuracy
Actuals of a retail store
Riemer, M., Vempaty, A., Calmon, F., Heath, F., Hull, R., and Khabiri, E., Correcting Forecast with Multifactor Neural Attention, Proceedings of the 33rd International Conference
on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. http://jmlr.org/proceedings/papers/v48/riemer16.pdf
T. J. Watson IBM
Research Center
© 2016 IBM Corporation
Global Business Services
Same color on the map  Similar
context considering all external metrics
5
Retail Use Case: Identification of Low/High Performers
Groups of similar stores in locations with
similar hyperlocal contexts
Category: “All Products’, “Electronics”, or “Cosmetics”…
Top Performer
Top Performer
Top Performer
Top Performer
Group 1 Baseline
Group 2 Baseline
Group 3 Baseline
Group 4 Baseline
Potential Revenue Increase:
Rev Inc G1= ∆𝑖
Rev Inc G2= ∆𝑖
Rev Inc G3 = ∆𝑖
Rev Inc G4 = ∆𝑖
Micro-Segmentation + External Metrics  Higher Accuracy for Root Cause Analysis and Revenue Increase
© 2016 IBM Corporation
Global Business Services
Population Movement Analytics
6
Store in Dallas
Close, but few
visits. Why?
15%20%
7%
9%
12%
5% of visits
Percentage of visits
based on buyer’s
Home Location,
obtained via
anonymous app use
analysis.
18%
Potential location
for a new store.
Advertisement
• Population demographics
• Where people are and go
P
Market Campaign
• Interests of each region (% of visits)
• Population density
P
Other Use Cases
City Planning
• Traffic growth
• Precise route
• Emergency Services
P
© 2016 IBM Corporation
Global Business Services
Telecommunication Use Cases: Quality of Services (Tower Location)
7
Affluent Houses  Life Time Revenue (LTR)
High
Low
Medium
Congestion
High
Low Medium
Intuition:
New tower
Max LTR: Ideal
position for a
new tower
Congestion
Famous band free show,
Saturday, 9-11PM:
Tower will be over capacity
Schedule a mobile base
antenna during event
© 2016 IBM Corporation
Global Business Services
Use cases are countless…
Banking and Finance
1. Branch Segmentation / New Market Opportunities
2. Cash Demand Forecasting
3. Promotion Customization
4. Staffing Mix / Specialty Account Services
5. Customer Churn
6. ATM Kiosk-to-Location Ratio Optimization
Retail
1. Uncaptured Opportunity
2. Assortment Optimization
3. Out of Stock
4. Demand Forecasting
5. Dynamic Pricing
6. Promotion Effectiveness
Insurance
1. Risk Management and Pricing Optimization
2. Portfolio Suitability
3. Demand Forecasting
4. Staffing Mix / Specialty Account Services
5. Damage Forecasting
City Analytics Industry Use Cases
Consumer Packaged Goods
1. Product mix
2. Out of Stock
3. Visibility
4. Expansion Opportunity
5. Customer Churn
6. Promotion Effectiveness
Travel and Transportation
1. Booking Traffic Forecasting Based on POIs
2. Service Relative Pricing Model
3. Promotion Customization
4. Amenity Mix
5. Cancellation Forecasting
Telecommunications
1. Customer Churn
2. Package/Service Offering Optimization
3. Coverage Optimization
4. New Product Demand
5. Device Repair Services
6. Service Outage Forecasting
8
© 2016 IBM Corporation
Global Business Services
Geographic Hierarchy, External Metrics, and Polygons
9
Rockaways
Manhattan
Soho
Midtown
Brooklyn
Queens
Southern
Eastern
Central
External Metric Data Point Domain defined
by coordinates: Temperature at (x,y) is 72 F.
(x,y)
External Metric Data Point Domain defined by a node
of the hierarchy: It’s raining in Queens.  It’s raining
in all polygons under Queens.
Level 0
Level 1
Level 2
New York
ManhattanBrooklyn Queens
Soho Midtown RockawaysCentralSouthern Eastern
Nodes
((lat lon, lat lon, … , lat lon))
((lat lon, lat lon, … , lat lon), (lat lon, lat lon, … , lat lon))
Polygon 1
Polygon 2 Polygon 3
Rockaways:
Central:
Most cities have files with the boundaries of sub-regions
represented as polygons:
© 2016 IBM Corporation
Global Business Services
Associating External and Internal Contexts
10
External
Metrics,
Events,
News…
Geographic
Hierarchy
Polygons
Prime
Entities
(Stores, Towers,
ATM…)
Customer-
Specific
Metrics
Customer
Hierarchies
(Product, Sales…)
External/Public Context Internal/Customer-Specific Context
Coordinates of Prime
Entities of any customer
can instantly leverage the
external context
associated to polygons
Easily replaced for any customerSame for all customers
IBM Metro Pulse Solution
© 2016 IBM Corporation
Global Business Services
Fundamental Polygon Functions
11
2) polygons_intersection(“Polygon P”, “Polygon Q”)
1polygons_intersection(“Pol 1”, “Pol 2”)
0polygons_intersection(“Pol 1”, “Pol 3”)
Pol 1
Pol 2
Pol 3
Data Quality: No two polygons under the same
hierarchy can intersect on any point other than
on the edges or vertices.
1) point_in_polygon(“Point X”, “Polygon P”)
Pol 1
Pol 2
Pol 3
Pol 4
A
B
C
1point_in_polygon(“A”, “Pol 2”)
0point_in_polygon(“B”, “Pol 3”)
Data Quality: All Prime Entities and Points of
Interest must belong to one and only one
polygon in each geographic hierarchy.
© 2016 IBM Corporation
Global Business Services
External Data Normalization Via a Reference Polygon
12
Reference Polygon
Pol 1
Pol 2
Pol 3
Pol 4
Metric 1: Original
Pol 1
Pol 2
Pol 3
Pol 4
“Metric 1” values are based on a set
of polygons that don’t match the
reference polygon.
Pol 1
Pol 2
Pol 3
Pol 4
Metric 1: Normalized
Different types of metrics (e.g., count,
temperature) require different types
of aggregation methods.
© 2016 IBM Corporation
Global Business Services
External Data
Landing Zone
IBM Data Lake
…
Metro Pulse High Level Architecture
13
Global Enriched
City Repository
External Data From
Cities All Over The World)
Geographic Boundaries,
Polygons, and Hierarchies
Analytics
Workbench
Customer G
Analytics
Workbench
Customer J
...
Customer G
Specific Data
Customer J
Specific Data
On Premise
On Premise
DaaS
Cities relevant to
Customer Z
DaaS
Cities relevant to
Customer L
DaaS
Cities relevant to
Customer K
Customers interested in external data only.
...
Analytics
Workbench
Customer A
Analytics
Workbench
Customer B
Analytics
Workbench
Customer F
...
Cities relevant to
Customer F
Customer A
Specific Data
Customer B
Specific Data
Customer F
Specific Data
On the Cloud
Analytics
Workbench
Gold Copy
© 2016 IBM Corporation
Global Business Services
Weather
GBS Data Lake
ExternalData
byCity
Twitter
Census
...
Geographical Borders,
Polygons, and Hierarchies
Metro Pulse
Global City
Repository
(Curated Data)
REST
API
Power
Users
LandingZone
DaaS
Metro Pulse Analytical Workbench Gold Copy
(One Deployment per Customer)
POS
ATM
Cell Towers
...
Files,
Tables
SFTP / Direct
Connections
IngestionLayer
Customer-SpecificData
byCity/Site
Metro Pulse Architecture – Version: 2.1
Performance Layer
14
Data
Scientists
Size of
Prize
Movement
Analytics
News
Analysis ...
Modeling
Enhanced
Forecast
Customer-
Specific City
Repository
Core
Analytics
Parameters
Repository
Sandbox
DaaS
Visualization
Business
User
Power
Users
AccessServices
RESTAPI
© 2016 IBM Corporation
Global Business Services
D3…
Data Lake
Analytics Workbench Data Flow
15
Raw
Internal
Data
Raw
Internal
Data
Clean
Internal
Data
SFTP
Validated
Internal
Data
Tabular
Internal
Data
Derived
Data
Consumable
Data
Visualized
Data
Raw
External
Data
Raw
External
Data
Clean
External
Data
Validated
External
Data
Tabular
External
Data
Published
Data
Cached
Published
Data
Data
Samples
Results
New Core
Analytics
Sandbox
Published in
Production
Published in
Production
Data
Samples
Results
New
Analytics
Sandbox
Published in
Production
Hadoop Cluster: HDFS and HBASEStaging NodeCustomer’s Site
Cassandra Redis
User’s
Additional
Data
Customer’s Site
User’s
Database
Customer’s Site
Spark
Spark
Integrated
Data
Node.js
Node.js
Micro services reusable not only for other customers, but also for other solutions
© 2016 IBM Corporation
Global Business Services
Micro Services for Data Ingestion and Curation
16
Data Sources Ingestion Engine
RDMBS
Structured
Files
Unstructured
Copy
Data
HadoopEdge Node
Analytic
Persistence
Curation Engine
Hadoop, HBASE.
Cassandra, Redis…
Get
Data
Raw Data
Store
Prepare
Raw Data
Curate
Data
Transform /
Enrich Data
Conformed/
Polyglot Data
Store
23
1 2 3
4 5 6
7 8 9 10
11 12 13 14
15
16 17 18 19 20 21
22 2322 2322
1 2 3
19 Reference Data Lookup
20 Transform Data
21 Enrich Data
22 Archive
23 Purge
1 Error & Exception Processing
2 Configuration set up
3 Audit, Balance, & Control
4 Transport Data from Source to Edge Node
5 Convert Data Formats
6 Copy/Move Data to Hadoop
7 Preprocessing Service
8 Technical Data Validation (TDQ)
9 Source Delta Processing
10 Persist Raw Data
11 Catalog Raw Data
12 Profile Data
13 Cross File Analysis
14 Causality Analysis
15 Target Load Service
16 Business Data Validation
17 Merge / Match
18 Manage Keys
Micro Services
© 2016 IBM Corporation
Global Business Services
Loading Geographic Hierarchy to HBASE
Table L0
Row Desc …
London London is… Great Britain…
Paris Paris is… Continental Europe…
Table L1
Row Desc History…
London:Central … ……
London:North … ……
HistoryName
London
Paris
Central
North
Name
…
Paris:Central … ……Central
…
…
Table L2
Row Desc History…
London:Central:Kensington … ……
London:Central:Buckingham … ……
Kensington
Buckingham
Name
…
Table L3
Row Desc History…
London:Central:Kensington:Notting Barns … ……
… … ……
Notting Barns
…
Name
…
P1 P2 P3 PN…
P1 P2 P3 PN…
P1 P2 P3 PN…
P1 P2 P3 PN…
Column Family: Data Column Family: Polygons
Column Family: Data Column Family: Polygons
Column Family: Data Column Family: Polygons
Column Family: Data Column Family: Polygons
50484 51673 54735 53896
75736 78493 78303 79659
50484 51673 54735
50484 51673
50484
© 2016 IBM Corporation
Global Business Services
Metro Pulse Analytical
Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
...
Metro Pulse Analytical
Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
...
Metro Pulse Analytical
Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
...
Easy to broadcast same data to multiple
customers. Easy to add new customers.
Metro Pulse Analytical
Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
...
Ingesting External Data via Flume
18
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
...Metro Pulse Global Repository Flume Server
Global City
Repository
Tweets
Weather
News
Internet
Agents can be optimally configured according
to the data sources characteristics
Each agent writes to a different HDFS folders:
no conflict, good for parallel execution
Each source is captured as
a HBASE column family
One data source per agent: easy
to add new sources
© 2016 IBM Corporation
Global Business Services
Performance Layer
19
- V_Transaction
- V_Level_Entity
- V_Polygon_Entity
- V_Size_of_Prize
...
Cache Manager
Get_View(“XYZ”)
- V_Level_Entity
- V_Size_of_Prize
API
- If “XYZ” in Redis, return “XYZ”
- Else:
- Get “XYZ” from Cassandra
- Return “XYZ” to the API
- Load “XYZ” to Redis
“XYZ”
Eviction Policy: Less
Recently Used
Sub-second latency and high throughput Dashboards small files
High throughput for large files  DaaS
© 2016 IBM Corporation
Global Business Services
Sample of Visualization Objects on D3.js
20
© 2016 IBM Corporation
Global Business Services
21

More Related Content

What's hot (20)

PPTX
Fighting Financial Crime with Artificial Intelligence
DataWorks Summit
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PPTX
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
PDF
Hybrid Cloud Strategy for Big Data and Analytics
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
PDF
Data in Motion vs Data at Rest
Internap
 
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
PPTX
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
DataWorks Summit
 
PPTX
Top 5 Strategies for Retail Data Analytics
Hortonworks
 
PDF
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Dataconomy Media
 
PPTX
Data Aggregation, Curation and analytics for security and situational awareness
DataWorks Summit/Hadoop Summit
 
PDF
Oracle Stream Analytics - Developer Introduction
Jeffrey T. Pollock
 
PDF
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Edwin Poot
 
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR Technologies
 
PDF
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PPTX
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
DataWorks Summit
 
PDF
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 
Fighting Financial Crime with Artificial Intelligence
DataWorks Summit
 
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
Hybrid Cloud Strategy for Big Data and Analytics
DataWorks Summit/Hadoop Summit
 
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
Data in Motion vs Data at Rest
Internap
 
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
DataWorks Summit
 
Top 5 Strategies for Retail Data Analytics
Hortonworks
 
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Dataconomy Media
 
Data Aggregation, Curation and analytics for security and situational awareness
DataWorks Summit/Hadoop Summit
 
Oracle Stream Analytics - Developer Introduction
Jeffrey T. Pollock
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Edwin Poot
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR Technologies
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
DataWorks Summit
 
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 

Similar to Using Hadoop for Cognitive Analytics (20)

PDF
Taming Big Data With Modern Software Architecture
Big Data User Group Karlsruhe/Stuttgart
 
PDF
Improve Store Expansion (Territory Management Featuring)
Esri España
 
PDF
Mobility and Business Intelligence : A marriage made in heaven
Jean-Michel Franco
 
PPT
How Retail Banks Use MongoDB
MongoDB
 
PDF
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
HostedbyConfluent
 
PDF
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Technologies
 
PDF
IBM Z for the Digital Enterprise 2018 - Z Keynote
DevOps for Enterprise Systems
 
PDF
Building the Cognitive Era : Big Data Strategies
Kevin Sigliano
 
PPTX
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
DataBench
 
PPTX
Webinar: How Financial Services Organizations Use MongoDB
MongoDB
 
PDF
Analytics on z Systems Focus on Real Time - Hélène Lyon
NRB
 
PPTX
Netvibes for Financial Services
Netvibes
 
PPTX
Next-Gen уже здесь
CEE-SEC(R)
 
PDF
How Financial Services Organizations Use MongoDB
MongoDB
 
PDF
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Databricks
 
PDF
Big Data : nouvelle donne et opportunités - par JM Lazard, EDHEC 95, CEO de O...
Christelle EDHEC
 
PDF
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Big Data Spain
 
PDF
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
Connected Data World
 
PPTX
Overview of Next Generation IT trends
Yuvaraj Ilangovan
 
PPT
Babak sorkhpour seminar in 80 8-24
Babak Sorkhpour
 
Taming Big Data With Modern Software Architecture
Big Data User Group Karlsruhe/Stuttgart
 
Improve Store Expansion (Territory Management Featuring)
Esri España
 
Mobility and Business Intelligence : A marriage made in heaven
Jean-Michel Franco
 
How Retail Banks Use MongoDB
MongoDB
 
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
HostedbyConfluent
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Technologies
 
IBM Z for the Digital Enterprise 2018 - Z Keynote
DevOps for Enterprise Systems
 
Building the Cognitive Era : Big Data Strategies
Kevin Sigliano
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
DataBench
 
Webinar: How Financial Services Organizations Use MongoDB
MongoDB
 
Analytics on z Systems Focus on Real Time - Hélène Lyon
NRB
 
Netvibes for Financial Services
Netvibes
 
Next-Gen уже здесь
CEE-SEC(R)
 
How Financial Services Organizations Use MongoDB
MongoDB
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Databricks
 
Big Data : nouvelle donne et opportunités - par JM Lazard, EDHEC 95, CEO de O...
Christelle EDHEC
 
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Big Data Spain
 
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
Connected Data World
 
Overview of Next Generation IT trends
Yuvaraj Ilangovan
 
Babak sorkhpour seminar in 80 8-24
Babak Sorkhpour
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Python basic programing language for automation
DanialHabibi2
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
July Patch Tuesday
Ivanti
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 

Using Hadoop for Cognitive Analytics

  • 1. Using Hadoop for Cognitive Analytics Pedro Desouza, Ph.D. Associate Partner Big Data & Analytics Center of Competence IBM Global Business Services June 29, 2016
  • 2. © 2016 IBM Corporation Global Business Services Outline 2 P Metro Pulse: Enhancing Decision Making Processes With Hyperlocal Data DashboardsP Use Cases In Multiple IndustriesP Geographic Hierarchies, External Metrics, and Mapping RepresentationP Integration Of External and Customer-Specific MetricsP Solution ArchitectureP Technological ComponentsP Micro Services for Data Ingestion and CurationP
  • 3. © 2016 IBM Corporation Global Business Services Improving Decision Making Accuracy by Combining Business Metrics with Hyperlocal Data 3 Weather Social Media Sentiment Economics… Events Thousands of them together, on a single repository Other Points of Interests Subway Stations Demographics Hyperlocal Data Business decision can be made on precise hyperlocal context for each store Store Context Combiningbusinessmetricsofeachstore withhyperlocaldataprovidesinsightsvia visualinspectionandadvancedanalytics Demand Forecast, Marketing Campaign, Distribution Plan and many other business decisions are usually based on aggregate levels of data that don’t precisely consider the context where the business operates. Stores in London
  • 4. © 2016 IBM Corporation Global Business Services Improving Forecast Accuracy with External Data 4 Traditional Method: Neuron Net, ARIMA… Forecast based on Neural Network with External Data: 23.9% better accuracy Actuals of a retail store Riemer, M., Vempaty, A., Calmon, F., Heath, F., Hull, R., and Khabiri, E., Correcting Forecast with Multifactor Neural Attention, Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. http://jmlr.org/proceedings/papers/v48/riemer16.pdf T. J. Watson IBM Research Center
  • 5. © 2016 IBM Corporation Global Business Services Same color on the map  Similar context considering all external metrics 5 Retail Use Case: Identification of Low/High Performers Groups of similar stores in locations with similar hyperlocal contexts Category: “All Products’, “Electronics”, or “Cosmetics”… Top Performer Top Performer Top Performer Top Performer Group 1 Baseline Group 2 Baseline Group 3 Baseline Group 4 Baseline Potential Revenue Increase: Rev Inc G1= ∆𝑖 Rev Inc G2= ∆𝑖 Rev Inc G3 = ∆𝑖 Rev Inc G4 = ∆𝑖 Micro-Segmentation + External Metrics  Higher Accuracy for Root Cause Analysis and Revenue Increase
  • 6. © 2016 IBM Corporation Global Business Services Population Movement Analytics 6 Store in Dallas Close, but few visits. Why? 15%20% 7% 9% 12% 5% of visits Percentage of visits based on buyer’s Home Location, obtained via anonymous app use analysis. 18% Potential location for a new store. Advertisement • Population demographics • Where people are and go P Market Campaign • Interests of each region (% of visits) • Population density P Other Use Cases City Planning • Traffic growth • Precise route • Emergency Services P
  • 7. © 2016 IBM Corporation Global Business Services Telecommunication Use Cases: Quality of Services (Tower Location) 7 Affluent Houses  Life Time Revenue (LTR) High Low Medium Congestion High Low Medium Intuition: New tower Max LTR: Ideal position for a new tower Congestion Famous band free show, Saturday, 9-11PM: Tower will be over capacity Schedule a mobile base antenna during event
  • 8. © 2016 IBM Corporation Global Business Services Use cases are countless… Banking and Finance 1. Branch Segmentation / New Market Opportunities 2. Cash Demand Forecasting 3. Promotion Customization 4. Staffing Mix / Specialty Account Services 5. Customer Churn 6. ATM Kiosk-to-Location Ratio Optimization Retail 1. Uncaptured Opportunity 2. Assortment Optimization 3. Out of Stock 4. Demand Forecasting 5. Dynamic Pricing 6. Promotion Effectiveness Insurance 1. Risk Management and Pricing Optimization 2. Portfolio Suitability 3. Demand Forecasting 4. Staffing Mix / Specialty Account Services 5. Damage Forecasting City Analytics Industry Use Cases Consumer Packaged Goods 1. Product mix 2. Out of Stock 3. Visibility 4. Expansion Opportunity 5. Customer Churn 6. Promotion Effectiveness Travel and Transportation 1. Booking Traffic Forecasting Based on POIs 2. Service Relative Pricing Model 3. Promotion Customization 4. Amenity Mix 5. Cancellation Forecasting Telecommunications 1. Customer Churn 2. Package/Service Offering Optimization 3. Coverage Optimization 4. New Product Demand 5. Device Repair Services 6. Service Outage Forecasting 8
  • 9. © 2016 IBM Corporation Global Business Services Geographic Hierarchy, External Metrics, and Polygons 9 Rockaways Manhattan Soho Midtown Brooklyn Queens Southern Eastern Central External Metric Data Point Domain defined by coordinates: Temperature at (x,y) is 72 F. (x,y) External Metric Data Point Domain defined by a node of the hierarchy: It’s raining in Queens.  It’s raining in all polygons under Queens. Level 0 Level 1 Level 2 New York ManhattanBrooklyn Queens Soho Midtown RockawaysCentralSouthern Eastern Nodes ((lat lon, lat lon, … , lat lon)) ((lat lon, lat lon, … , lat lon), (lat lon, lat lon, … , lat lon)) Polygon 1 Polygon 2 Polygon 3 Rockaways: Central: Most cities have files with the boundaries of sub-regions represented as polygons:
  • 10. © 2016 IBM Corporation Global Business Services Associating External and Internal Contexts 10 External Metrics, Events, News… Geographic Hierarchy Polygons Prime Entities (Stores, Towers, ATM…) Customer- Specific Metrics Customer Hierarchies (Product, Sales…) External/Public Context Internal/Customer-Specific Context Coordinates of Prime Entities of any customer can instantly leverage the external context associated to polygons Easily replaced for any customerSame for all customers IBM Metro Pulse Solution
  • 11. © 2016 IBM Corporation Global Business Services Fundamental Polygon Functions 11 2) polygons_intersection(“Polygon P”, “Polygon Q”) 1polygons_intersection(“Pol 1”, “Pol 2”) 0polygons_intersection(“Pol 1”, “Pol 3”) Pol 1 Pol 2 Pol 3 Data Quality: No two polygons under the same hierarchy can intersect on any point other than on the edges or vertices. 1) point_in_polygon(“Point X”, “Polygon P”) Pol 1 Pol 2 Pol 3 Pol 4 A B C 1point_in_polygon(“A”, “Pol 2”) 0point_in_polygon(“B”, “Pol 3”) Data Quality: All Prime Entities and Points of Interest must belong to one and only one polygon in each geographic hierarchy.
  • 12. © 2016 IBM Corporation Global Business Services External Data Normalization Via a Reference Polygon 12 Reference Polygon Pol 1 Pol 2 Pol 3 Pol 4 Metric 1: Original Pol 1 Pol 2 Pol 3 Pol 4 “Metric 1” values are based on a set of polygons that don’t match the reference polygon. Pol 1 Pol 2 Pol 3 Pol 4 Metric 1: Normalized Different types of metrics (e.g., count, temperature) require different types of aggregation methods.
  • 13. © 2016 IBM Corporation Global Business Services External Data Landing Zone IBM Data Lake … Metro Pulse High Level Architecture 13 Global Enriched City Repository External Data From Cities All Over The World) Geographic Boundaries, Polygons, and Hierarchies Analytics Workbench Customer G Analytics Workbench Customer J ... Customer G Specific Data Customer J Specific Data On Premise On Premise DaaS Cities relevant to Customer Z DaaS Cities relevant to Customer L DaaS Cities relevant to Customer K Customers interested in external data only. ... Analytics Workbench Customer A Analytics Workbench Customer B Analytics Workbench Customer F ... Cities relevant to Customer F Customer A Specific Data Customer B Specific Data Customer F Specific Data On the Cloud Analytics Workbench Gold Copy
  • 14. © 2016 IBM Corporation Global Business Services Weather GBS Data Lake ExternalData byCity Twitter Census ... Geographical Borders, Polygons, and Hierarchies Metro Pulse Global City Repository (Curated Data) REST API Power Users LandingZone DaaS Metro Pulse Analytical Workbench Gold Copy (One Deployment per Customer) POS ATM Cell Towers ... Files, Tables SFTP / Direct Connections IngestionLayer Customer-SpecificData byCity/Site Metro Pulse Architecture – Version: 2.1 Performance Layer 14 Data Scientists Size of Prize Movement Analytics News Analysis ... Modeling Enhanced Forecast Customer- Specific City Repository Core Analytics Parameters Repository Sandbox DaaS Visualization Business User Power Users AccessServices RESTAPI
  • 15. © 2016 IBM Corporation Global Business Services D3… Data Lake Analytics Workbench Data Flow 15 Raw Internal Data Raw Internal Data Clean Internal Data SFTP Validated Internal Data Tabular Internal Data Derived Data Consumable Data Visualized Data Raw External Data Raw External Data Clean External Data Validated External Data Tabular External Data Published Data Cached Published Data Data Samples Results New Core Analytics Sandbox Published in Production Published in Production Data Samples Results New Analytics Sandbox Published in Production Hadoop Cluster: HDFS and HBASEStaging NodeCustomer’s Site Cassandra Redis User’s Additional Data Customer’s Site User’s Database Customer’s Site Spark Spark Integrated Data Node.js Node.js Micro services reusable not only for other customers, but also for other solutions
  • 16. © 2016 IBM Corporation Global Business Services Micro Services for Data Ingestion and Curation 16 Data Sources Ingestion Engine RDMBS Structured Files Unstructured Copy Data HadoopEdge Node Analytic Persistence Curation Engine Hadoop, HBASE. Cassandra, Redis… Get Data Raw Data Store Prepare Raw Data Curate Data Transform / Enrich Data Conformed/ Polyglot Data Store 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2322 2322 1 2 3 19 Reference Data Lookup 20 Transform Data 21 Enrich Data 22 Archive 23 Purge 1 Error & Exception Processing 2 Configuration set up 3 Audit, Balance, & Control 4 Transport Data from Source to Edge Node 5 Convert Data Formats 6 Copy/Move Data to Hadoop 7 Preprocessing Service 8 Technical Data Validation (TDQ) 9 Source Delta Processing 10 Persist Raw Data 11 Catalog Raw Data 12 Profile Data 13 Cross File Analysis 14 Causality Analysis 15 Target Load Service 16 Business Data Validation 17 Merge / Match 18 Manage Keys Micro Services
  • 17. © 2016 IBM Corporation Global Business Services Loading Geographic Hierarchy to HBASE Table L0 Row Desc … London London is… Great Britain… Paris Paris is… Continental Europe… Table L1 Row Desc History… London:Central … …… London:North … …… HistoryName London Paris Central North Name … Paris:Central … ……Central … … Table L2 Row Desc History… London:Central:Kensington … …… London:Central:Buckingham … …… Kensington Buckingham Name … Table L3 Row Desc History… London:Central:Kensington:Notting Barns … …… … … …… Notting Barns … Name … P1 P2 P3 PN… P1 P2 P3 PN… P1 P2 P3 PN… P1 P2 P3 PN… Column Family: Data Column Family: Polygons Column Family: Data Column Family: Polygons Column Family: Data Column Family: Polygons Column Family: Data Column Family: Polygons 50484 51673 54735 53896 75736 78493 78303 79659 50484 51673 54735 50484 51673 50484
  • 18. © 2016 IBM Corporation Global Business Services Metro Pulse Analytical Workbench Edge Node Flume Agent: Tweets Flume Agent: Weather Flume Agent: News Hadoop Data Nodes: HDFS Tweets Weather News ... Metro Pulse Analytical Workbench Edge Node Flume Agent: Tweets Flume Agent: Weather Flume Agent: News Hadoop Data Nodes: HDFS Tweets Weather News ... Metro Pulse Analytical Workbench Edge Node Flume Agent: Tweets Flume Agent: Weather Flume Agent: News Hadoop Data Nodes: HDFS Tweets Weather News ... Easy to broadcast same data to multiple customers. Easy to add new customers. Metro Pulse Analytical Workbench Edge Node Flume Agent: Tweets Flume Agent: Weather Flume Agent: News Hadoop Data Nodes: HDFS Tweets Weather News ... Ingesting External Data via Flume 18 Flume Agent: Tweets Flume Agent: Weather Flume Agent: News ...Metro Pulse Global Repository Flume Server Global City Repository Tweets Weather News Internet Agents can be optimally configured according to the data sources characteristics Each agent writes to a different HDFS folders: no conflict, good for parallel execution Each source is captured as a HBASE column family One data source per agent: easy to add new sources
  • 19. © 2016 IBM Corporation Global Business Services Performance Layer 19 - V_Transaction - V_Level_Entity - V_Polygon_Entity - V_Size_of_Prize ... Cache Manager Get_View(“XYZ”) - V_Level_Entity - V_Size_of_Prize API - If “XYZ” in Redis, return “XYZ” - Else: - Get “XYZ” from Cassandra - Return “XYZ” to the API - Load “XYZ” to Redis “XYZ” Eviction Policy: Less Recently Used Sub-second latency and high throughput Dashboards small files High throughput for large files  DaaS
  • 20. © 2016 IBM Corporation Global Business Services Sample of Visualization Objects on D3.js 20
  • 21. © 2016 IBM Corporation Global Business Services 21