SlideShare a Scribd company logo
Page 1 Hortonworks © 2014
Distilling Hadoop Patterns of Use
Shaun Connolly, Hortonworks
@shaunconnolly
March 25, 2014
Page 2 Hortonworks © 2014
Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Reseller Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
Page 3 Hortonworks © 2014
Data Continues to Grow Sharply
2020:	
  
Digital	
  universe	
  =	
  40	
  Ze'abytes	
  	
  
2012:	
  
Digital	
  universe	
  =	
  20	
  Ze'abytes	
  
1	
  Ze2abyte	
  (ZB)	
  =	
  1	
  billion	
  Terabytes	
  (TB)	
  	
  
2014:	
  
31%	
  of	
  enterprises	
  managing	
  more	
  than	
  1	
  Petabyte	
  
Social	
  
Networks	
  
Machine	
  
Generated	
  
Documents,	
  	
  
Emails	
  
OLTP,	
  ERP,	
  	
  
CRM	
  Systems	
  
Geoloca@on	
  
Data	
  
Sensor	
  
Data	
  
Web	
  Logs,	
  
Click	
  Streams	
  
85%	
  of	
  growth	
  from	
  new	
  types	
  of	
  
data	
  with	
  machine-­‐generated	
  
data	
  increasing	
  15x	
  
Sources:	
  IDC	
  and	
  IDG	
  Enterprise	
  
Page 4 Hortonworks © 2014
Cameras and
microphones widely
deployed
New routes to market via
intelligent objects
Content and services
via connected
products
Everything
has a URL
Remote sensing of
objects and environment
Augmented
reality
Situational
decision support
Building and
infrastructure management
Over 50% of Internet connections are things:
2011: 15+ billion permanent, 50+ billion intermittent
2020: 30+ billion permanent, >200 billion intermittent
Source: Gartner Keynote at Hadoop Summit 2013
Page 5 Hortonworks © 2014
Harnessing Big Data is
transformational to business models
Enables the move from post-transaction,
reactive analysis of subsets of data stored in
silos to a world of pre-transaction, interactive
insights across all data that impacts both the top
and bottom lines
Page 6 Hortonworks © 2014
DATA	
  SYSTEMS	
  APPLICATIONS	
  
Repositories	
  
ROOMS
Sta@s@cal	
  
Analysis	
  
BI	
  /	
  Repor@ng,	
  
Ad	
  Hoc	
  Analysis	
  
Interac@ve	
  Web	
  
&	
  Mobile	
  Applica@ons	
  
Enterprise	
  
Applica@ons	
  
EDW MPPRDBMS	
   EDW	
   MPP	
  
Governance	
  	
  
&	
  Integra=on	
  
Security	
  
Opera=ons	
  
Data	
  Access	
  
Data	
  Management	
  
SOURCES	
  
OLTP,	
  ERP,	
  
CRM	
  Systems	
  
Documents,	
  	
  
Emails	
  
Web	
  Logs,	
  
Click	
  Streams	
  
Social	
  
Networks	
  
Machine	
  
Generated	
  
Sensor	
  
Data	
  
Geoloca@on	
  
Data	
  
Modern Data Architecture with Hadoop
OPERATIONS	
  TOOLS	
  
Provision,
Manage &
Monitor
DEV	
  &	
  DATA	
  TOOLS	
  
Build &
Test
ENTERPRISE HADOOP
Page 7 Hortonworks © 2014
MDA Unlocks New Approach to Insight
Enterprise	
  Hadoop	
  
Mul@ple	
  Query	
  Engines	
  
Itera@ve	
  Process:	
  Explore,	
  Transform,	
  Analyze	
  
SQL	
  
Single	
  Query	
  Engine	
  
Repeatable	
  Linear	
  Process	
  
Determine	
  
list	
  of	
  
ques@ons	
  
Current	
  Approach	
  
	
  
Apply	
  schema	
  on	
  write	
  
	
  
Dependent	
  on	
  IT	
  
Augment	
  with	
  Hadoop	
  
	
  
Apply	
  schema	
  on	
  read	
  
	
  
Support	
  range	
  of	
  access	
  paRerns	
  to	
  data	
  stored	
  in	
  HDFS	
  
Design	
  
solu@ons	
  
Collect	
  
structured	
  
data	
  
Ask	
  
ques@ons	
  
from	
  list	
  
Detect	
  
addi@onal	
  
ques@ons	
  
Batch	
   Interac@ve	
   Real-­‐@me	
   Streaming	
  
Page 8 Hortonworks © 2014
Schema-on-Write vs. Schema-on-Read
Standard Digital Camera
§ Zoom & focus first
§ Capture limited set of pixels
§ Crop around the focused area
Lytro Lightfield Camera
§ Capture entire lightfield
§ Infinite zoom & focus
§ Crop any captured areas
Page 9 Hortonworks © 2014
MDA Uses Commodity Compute + Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Cloud Storage
HADOOP
NAS
Engineered System
Hadoop Enables Scalable
Compute & Storage at a
Compelling Cost Structure
Fully Loaded Cost per Raw TB of Data (min – max cost)
EDW/MPP
SAN
Page 10 Hortonworks © 2014
MDA Optimizes Data Warehouse
Analytics
20%
ETL Process
30%
Operations
50%
Current Reality
§  EDW at capacity; some usage
from low value workloads
§  Older transformed data
archived, unavailable for
ongoing exploration
§  Source data often discarded
Operations
50%
Analytics
50%
HADOOP
Parse, cleanse,
apply structure, transform
Augment with Hadoop
§  Free up EDW resources from low
value tasks
§  Keep 100% of source data and
historical data for ongoing exploration
§  Mine data for value after loading it
because of schema-on-read
Page 11 Hortonworks © 2014
Integrating with Existing InvestmentsAPPLICATIONS	
  DATA	
  SYSTEM	
  SOURCES	
  
RDBMS	
   EDW	
   MPP	
  
Emerging	
  Sources	
  	
  
(Sensor,	
  Sen=ment,	
  Geo,	
  Unstructured)	
  
HANA
BusinessObjects BI
OPERATIONAL	
  TOOLS	
  
DEV	
  &	
  DATA	
  TOOLS	
  
Exis=ng	
  Sources	
  	
  
(CRM,	
  ERP,	
  Clickstream,	
  Logs)	
  
INFRASTRUCTURE	
  
Page 12 Hortonworks © 2014
Powering the Modern Data Architecture
	
  	
  
Enables	
  deep	
  
insight	
  across	
  a	
  
large,	
  broad,	
  
diverse	
  set	
  of	
  data	
  
at	
  efficient	
  scale	
  	
  
Mul=-­‐Use	
  Data	
  PlaSorm	
  
Store	
  all	
  data	
  in	
  one	
  place,	
  process	
  in	
  many	
  ways	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
°	
  
°	
  
°	
  
°	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
°	
  
°	
  
°	
  
n	
  
Batch	
   Interac=ve	
   Real-­‐=me	
   Streaming	
  
Data Lake that contains ALL data;
raw sources and any processed data
over extended periods of time.
YARN	
  :	
  Data	
  Opera=ng	
  System	
  
Page 13 Hortonworks © 2014
How	
  Hadoop?	
  
	
  
“Hadoop	
  can	
  be	
  used	
  to	
  create	
  a	
  ‘data	
  lake’	
  –	
  an	
  integrated	
  
repository	
  of	
  data	
  from	
  internal	
  and	
  external	
  data	
  sources...	
  
Data	
  combined	
  from	
  mulVple	
  silos	
  can	
  help	
  your	
  organizaVon	
  
find	
  answers	
  to	
  complex	
  quesVons	
  that	
  no	
  one	
  has	
  previously	
  
dared	
  ask	
  or	
  known	
  how	
  to	
  ask.”	
  	
  
	
   	
  -­‐-­‐	
  Forrester	
  
Page 14 Hortonworks © 2014
The Common Journey with Hadoop
SCALE
SCOPE
More data and
analytic apps
New Analytic Apps
New types of data
LOB-driven
A Modern Data Architecture
	
   RDBMS
MPP
EDW
Governance
&Integration
Security
Operations
Data Access
Data Management
Page 15 Hortonworks © 2014
Unlock Value in New Types of Data
1.  Social
Understand how people are feeling and interacting –
right now
2.  Clickstream
Capture and analyze website visitors’ data trails and
optimize your website
3.  Sensor/Machine
Discover patterns in data streaming from remote
sensors and machines
4.  Geographic
Analyze location-based data to manage operations
where they occur
5.  Server Logs
Diagnose process failures and prevent security
breaches
6.  Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web
pages, emails, and documents
Value
+ Online archive
Data that was once purged or moved
to tape can be stored in Hadoop to
discover long term trends and
previously hidden value
Page 16 Hortonworks © 2014
20 Business Applications of Hadoop
Industry Use Case Type of Data
Financial Services
New Account Risk Screens Text, Server Logs
Trading Risk Server Logs
Insurance Underwriting Geographic, Sensor, Text
Telecom
Call Detail Records (CDRs) Machine, Geographic
Infrastructure Investment Machine, Server Logs
Real-time Bandwidth Allocation Server Logs, Text, Social
Retail
360° View of the Customer Clickstream, Text
Localized, Personalized Promotions Geographic
Website Optimization Clickstream
Manufacturing
Supply Chain and Logistics Sensor
Assembly Line Quality Assurance Sensor
Crowdsourced Quality Assurance Social
Healthcare
Use Genomic Data in Medical Trials Structured
Monitor Patient Vitals in Real-Time Sensor
Pharmaceuticals
Recruit and Retain Patients for Drug Trials Social, Clickstream
Improve Prescription Adherence Social, Unstructured, Geographic
Oil & Gas
Unify Exploration & Production Data Sensor, Geographic & Unstructured
Monitor Rig Safety in Real-Time Sensor, Unstructured
Government
ETL Offload in Response to Federal Budgetary Pressures Structured
Sentiment Analysis for Government Programs Social
Page 17 Hortonworks © 2014
360° Customer View for Home Supply Retailer
Problem
Disjoint customer engagement across all channels
Data repositories on website traffic, POS transactions and in-
home services exist in separate silos
Unable to perform analytics on customer buying behavior
across all channels
Limited ability for targeted marketing to specific segments
Solution
Unified system of engagement via “golden record”
Golden record enables targeted marketing capabilities:
customized coupons, promotions and emails
Deep visibility into all customers and all market segments
Unlocks rich, informed cross-sell & up-sell opportunities
Creating Opportunity
Data: Clickstream,
Unstructured, Structured
Retail
Major home
improvement retailer
>$74B in revenue
>300K employees
>2,200 stores
Page 18 Hortonworks © 2014
Monetize Anonymous & Aggregate Banking Data
Problem
Unable to unlock valuable cross-sell banking data
Bank possesses data that indicates larger macro-economic
trends, which can be monetized in secondary markets
Data sets are isolated in legacy silos controlled by LOBs
Regulations and company policies protect customer privacy
IT challenged by joining data while guaranteeing anonymity
Solution
Create cross-LOB data lake of de-identified data
Mortgage bankers, consumer bankers, credit card group and
treasury bankers have access to the same cross-sell data
Single point of security & privacy for de-identification, masking,
encryption, authentication and access control
Interoperability with SAS, Red Hat & Splunk
Creating Opportunity
Data: Structured,
Clickstream, Social &
Unstructured
Banking
One of the largest
US banks
Page 19 Hortonworks © 2014
Improving Efficiency
Data: SensorOptimize High-Tech Manufacturing
Problem
Ineffective root cause analysis on product defects
200 million digital storage devices manufactured yearly
>10K faulty devices returned by customers every month
Limited data available for root cause analysis means that
diagnosing problems is highly manual (physical inspections)
Subset of sensor data from QA testing retained 3-12 months
Solution
Created sensor data lake for 10x quality improvement
Repository holds 24 months of data for each device
Manufacturing dashboard allows >1,000 employees to search
data, with results returned in less than 1 second
Quality improved 10x: rate down to ~1K faulty devices / month
Manufacturing
Digital Storage
Devices
>$15B in revenue
>85K employees
Page 20 Hortonworks © 2014
Think Pigabyte, Not Petabyte
Page 21 Hortonworks © 2014
Enabling Hadoop for the Enterprise Journey
Capabili=es	
  
Ensure	
  enterprise	
  capabili@es	
  
are	
  delivered	
  in	
  100%	
  open	
  
source	
  to	
  benefit	
  all	
  
1
2Integra=on	
  
Interoperable	
  with	
  exis@ng	
  	
  
data	
  center	
  investments	
  
Skills	
  
Leverage	
  your	
  exis@ng	
  skills:	
  
development,	
  analy@cs,	
  
opera@ons	
  	
  3
Scale
Scope
More data and
analytic apps
New Analytic Apps
New types of data
LOB-driven
A Modern Data Architecture
	
   RDBMS
MPP
EDW
Governance
&Integration
Security
Operations
Data Access
Data Management
Page 22 Hortonworks © 2014
Try Hadoop Today… Get Involved
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
San Jose, CA
June 3 - 5, 2014
REGISTER NOW
Amsterdam
April 2 - 3, 2014
REGISTER NOW
Page 23 Hortonworks © 2014
Questions?
@shaunconnolly

More Related Content

What's hot (20)

PDF
2014 sept 4_hadoop_security
Adam Muise
 
PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
Shravan (Sean) Pabba
 
PDF
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
PPTX
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Kevin Minder
 
PPTX
Open Source Security Tools for Big Data
Rommel Garcia
 
PPTX
Securing the Hadoop Ecosystem
DataWorks Summit
 
PPTX
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
PPTX
Hadoop Security Features That make your risk officer happy
DataWorks Summit
 
PPTX
An Approach for Multi-Tenancy Through Apache Knox
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop ClusterClient Security Using Kerberos
Sarvesh Meena
 
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Abhiraj Butala
 
PDF
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Cloudera, Inc.
 
PDF
Nl HUG 2016 Feb Hadoop security from the trenches
Bolke de Bruin
 
PPTX
Hadoop Security Features that make your risk officer happy
Anurag Shrivastava
 
PDF
Hadoop Security
Timothy Spann
 
PPTX
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
DataWorks Summit
 
PDF
Hadoop Security: Overview
Cloudera, Inc.
 
PPTX
Apache Ranger
Rommel Garcia
 
PPTX
The Future of Hadoop Security - Hadoop Summit 2014
Cloudera, Inc.
 
PDF
TriHUG October: Apache Ranger
trihug
 
2014 sept 4_hadoop_security
Adam Muise
 
Hadoop security @ Philly Hadoop Meetup May 2015
Shravan (Sean) Pabba
 
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Kevin Minder
 
Open Source Security Tools for Big Data
Rommel Garcia
 
Securing the Hadoop Ecosystem
DataWorks Summit
 
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
Hadoop Security Features That make your risk officer happy
DataWorks Summit
 
An Approach for Multi-Tenancy Through Apache Knox
DataWorks Summit/Hadoop Summit
 
Hadoop ClusterClient Security Using Kerberos
Sarvesh Meena
 
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Abhiraj Butala
 
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Cloudera, Inc.
 
Nl HUG 2016 Feb Hadoop security from the trenches
Bolke de Bruin
 
Hadoop Security Features that make your risk officer happy
Anurag Shrivastava
 
Hadoop Security
Timothy Spann
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
DataWorks Summit
 
Hadoop Security: Overview
Cloudera, Inc.
 
Apache Ranger
Rommel Garcia
 
The Future of Hadoop Security - Hadoop Summit 2014
Cloudera, Inc.
 
TriHUG October: Apache Ranger
trihug
 

Viewers also liked (20)

PDF
DataAnalysis_Yan_BookReviewCropSci2014
Manjit Kang
 
PDF
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
Daqing Zhao
 
PPTX
Big data and the transport societal challenge - Maxime Flament
BigData_Europe
 
ZIP
Rapid JCR applications development with Sling
Bertrand Delacretaz
 
PDF
Using MapReduce for Large–scale Medical Image Analysis
Institute of Information Systems (HES-SO)
 
PDF
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Accenture
 
PDF
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
Enrico Palumbo
 
PDF
Big Data Analytics: Architectural Perspective
Sumit Kalra
 
PPT
A big-data architecture for real-time analytics
ramikaurraminder
 
PDF
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
Stefan Schwarz
 
PDF
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
PDF
Architecture for Real-Time and Batch Big Data Analytics
Nir Rubinstein
 
PDF
Agile data science
Joel Horwitz
 
PDF
A technical Introduction to Big Data Analytics
Pethuru Raj PhD
 
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
PDF
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
PDF
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Thoughtworks
 
PDF
Building Big Data Analytics Center Of Excellence
Dr. Mohan K. Bavirisetty
 
PDF
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
PDF
Business Process Maturity and Centers of Excellence
Sandy Kemsley
 
DataAnalysis_Yan_BookReviewCropSci2014
Manjit Kang
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
Daqing Zhao
 
Big data and the transport societal challenge - Maxime Flament
BigData_Europe
 
Rapid JCR applications development with Sling
Bertrand Delacretaz
 
Using MapReduce for Large–scale Medical Image Analysis
Institute of Information Systems (HES-SO)
 
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Accenture
 
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
Enrico Palumbo
 
Big Data Analytics: Architectural Perspective
Sumit Kalra
 
A big-data architecture for real-time analytics
ramikaurraminder
 
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
Stefan Schwarz
 
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Architecture for Real-Time and Batch Big Data Analytics
Nir Rubinstein
 
Agile data science
Joel Horwitz
 
A technical Introduction to Big Data Analytics
Pethuru Raj PhD
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Thoughtworks
 
Building Big Data Analytics Center Of Excellence
Dr. Mohan K. Bavirisetty
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Business Process Maturity and Centers of Excellence
Sandy Kemsley
 
Ad

Similar to Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics (20)

PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
PDF
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
PDF
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PDF
Enterprise Apache Hadoop: State of the Union
Hortonworks
 
PDF
Introduction to Hadoop
POSSCON
 
PDF
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
PDF
Hortonworks and HP Vertica Webinar
Hortonworks
 
PDF
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks
 
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
Hortonworks
 
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
PDF
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Hortonworks
 
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
PDF
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
PPTX
Ben Marden - Making sense of Big Data
WeAreEsynergy
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Enterprise Apache Hadoop: State of the Union
Hortonworks
 
Introduction to Hadoop
POSSCON
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
Hortonworks and HP Vertica Webinar
Hortonworks
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
Ben Marden - Making sense of Big Data
WeAreEsynergy
 
Ad

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
PDF
HDF 3.2 - What's New
Hortonworks
 
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
PDF
Premier Inside-Out: Apache Druid
Hortonworks
 
PDF
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
PDF
Making Enterprise Big Data Small with Ease
Hortonworks
 
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
PDF
Driving Digital Transformation Through Global Data Management
Hortonworks
 
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
HDF 3.2 - What's New
Hortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
Premier Inside-Out: Apache Druid
Hortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
Making Enterprise Big Data Small with Ease
Hortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
Driving Digital Transformation Through Global Data Management
Hortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 

Recently uploaded (20)

PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Human Resources Information System (HRIS)
Amity University, Patna
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 

Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

  • 1. Page 1 Hortonworks © 2014 Distilling Hadoop Patterns of Use Shaun Connolly, Hortonworks @shaunconnolly March 25, 2014
  • 2. Page 2 Hortonworks © 2014 Our Mission: Our Commitment Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Headquarters: Palo Alto, CA Employees: 300+ and growing Reseller Partners Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
  • 3. Page 3 Hortonworks © 2014 Data Continues to Grow Sharply 2020:   Digital  universe  =  40  Ze'abytes     2012:   Digital  universe  =  20  Ze'abytes   1  Ze2abyte  (ZB)  =  1  billion  Terabytes  (TB)     2014:   31%  of  enterprises  managing  more  than  1  Petabyte   Social   Networks   Machine   Generated   Documents,     Emails   OLTP,  ERP,     CRM  Systems   Geoloca@on   Data   Sensor   Data   Web  Logs,   Click  Streams   85%  of  growth  from  new  types  of   data  with  machine-­‐generated   data  increasing  15x   Sources:  IDC  and  IDG  Enterprise  
  • 4. Page 4 Hortonworks © 2014 Cameras and microphones widely deployed New routes to market via intelligent objects Content and services via connected products Everything has a URL Remote sensing of objects and environment Augmented reality Situational decision support Building and infrastructure management Over 50% of Internet connections are things: 2011: 15+ billion permanent, 50+ billion intermittent 2020: 30+ billion permanent, >200 billion intermittent Source: Gartner Keynote at Hadoop Summit 2013
  • 5. Page 5 Hortonworks © 2014 Harnessing Big Data is transformational to business models Enables the move from post-transaction, reactive analysis of subsets of data stored in silos to a world of pre-transaction, interactive insights across all data that impacts both the top and bottom lines
  • 6. Page 6 Hortonworks © 2014 DATA  SYSTEMS  APPLICATIONS   Repositories   ROOMS Sta@s@cal   Analysis   BI  /  Repor@ng,   Ad  Hoc  Analysis   Interac@ve  Web   &  Mobile  Applica@ons   Enterprise   Applica@ons   EDW MPPRDBMS   EDW   MPP   Governance     &  Integra=on   Security   Opera=ons   Data  Access   Data  Management   SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social   Networks   Machine   Generated   Sensor   Data   Geoloca@on   Data   Modern Data Architecture with Hadoop OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test ENTERPRISE HADOOP
  • 7. Page 7 Hortonworks © 2014 MDA Unlocks New Approach to Insight Enterprise  Hadoop   Mul@ple  Query  Engines   Itera@ve  Process:  Explore,  Transform,  Analyze   SQL   Single  Query  Engine   Repeatable  Linear  Process   Determine   list  of   ques@ons   Current  Approach     Apply  schema  on  write     Dependent  on  IT   Augment  with  Hadoop     Apply  schema  on  read     Support  range  of  access  paRerns  to  data  stored  in  HDFS   Design   solu@ons   Collect   structured   data   Ask   ques@ons   from  list   Detect   addi@onal   ques@ons   Batch   Interac@ve   Real-­‐@me   Streaming  
  • 8. Page 8 Hortonworks © 2014 Schema-on-Write vs. Schema-on-Read Standard Digital Camera § Zoom & focus first § Capture limited set of pixels § Crop around the focused area Lytro Lightfield Camera § Capture entire lightfield § Infinite zoom & focus § Crop any captured areas
  • 9. Page 9 Hortonworks © 2014 MDA Uses Commodity Compute + Storage $0 $20,000 $40,000 $60,000 $80,000 $180,000 Cloud Storage HADOOP NAS Engineered System Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Fully Loaded Cost per Raw TB of Data (min – max cost) EDW/MPP SAN
  • 10. Page 10 Hortonworks © 2014 MDA Optimizes Data Warehouse Analytics 20% ETL Process 30% Operations 50% Current Reality §  EDW at capacity; some usage from low value workloads §  Older transformed data archived, unavailable for ongoing exploration §  Source data often discarded Operations 50% Analytics 50% HADOOP Parse, cleanse, apply structure, transform Augment with Hadoop §  Free up EDW resources from low value tasks §  Keep 100% of source data and historical data for ongoing exploration §  Mine data for value after loading it because of schema-on-read
  • 11. Page 11 Hortonworks © 2014 Integrating with Existing InvestmentsAPPLICATIONS  DATA  SYSTEM  SOURCES   RDBMS   EDW   MPP   Emerging  Sources     (Sensor,  Sen=ment,  Geo,  Unstructured)   HANA BusinessObjects BI OPERATIONAL  TOOLS   DEV  &  DATA  TOOLS   Exis=ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   INFRASTRUCTURE  
  • 12. Page 12 Hortonworks © 2014 Powering the Modern Data Architecture     Enables  deep   insight  across  a   large,  broad,   diverse  set  of  data   at  efficient  scale     Mul=-­‐Use  Data  PlaSorm   Store  all  data  in  one  place,  process  in  many  ways   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   n   Batch   Interac=ve   Real-­‐=me   Streaming   Data Lake that contains ALL data; raw sources and any processed data over extended periods of time. YARN  :  Data  Opera=ng  System  
  • 13. Page 13 Hortonworks © 2014 How  Hadoop?     “Hadoop  can  be  used  to  create  a  ‘data  lake’  –  an  integrated   repository  of  data  from  internal  and  external  data  sources...   Data  combined  from  mulVple  silos  can  help  your  organizaVon   find  answers  to  complex  quesVons  that  no  one  has  previously   dared  ask  or  known  how  to  ask.”        -­‐-­‐  Forrester  
  • 14. Page 14 Hortonworks © 2014 The Common Journey with Hadoop SCALE SCOPE More data and analytic apps New Analytic Apps New types of data LOB-driven A Modern Data Architecture   RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management
  • 15. Page 15 Hortonworks © 2014 Unlock Value in New Types of Data 1.  Social Understand how people are feeling and interacting – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Value + Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value
  • 16. Page 16 Hortonworks © 2014 20 Business Applications of Hadoop Industry Use Case Type of Data Financial Services New Account Risk Screens Text, Server Logs Trading Risk Server Logs Insurance Underwriting Geographic, Sensor, Text Telecom Call Detail Records (CDRs) Machine, Geographic Infrastructure Investment Machine, Server Logs Real-time Bandwidth Allocation Server Logs, Text, Social Retail 360° View of the Customer Clickstream, Text Localized, Personalized Promotions Geographic Website Optimization Clickstream Manufacturing Supply Chain and Logistics Sensor Assembly Line Quality Assurance Sensor Crowdsourced Quality Assurance Social Healthcare Use Genomic Data in Medical Trials Structured Monitor Patient Vitals in Real-Time Sensor Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream Improve Prescription Adherence Social, Unstructured, Geographic Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured Monitor Rig Safety in Real-Time Sensor, Unstructured Government ETL Offload in Response to Federal Budgetary Pressures Structured Sentiment Analysis for Government Programs Social
  • 17. Page 17 Hortonworks © 2014 360° Customer View for Home Supply Retailer Problem Disjoint customer engagement across all channels Data repositories on website traffic, POS transactions and in- home services exist in separate silos Unable to perform analytics on customer buying behavior across all channels Limited ability for targeted marketing to specific segments Solution Unified system of engagement via “golden record” Golden record enables targeted marketing capabilities: customized coupons, promotions and emails Deep visibility into all customers and all market segments Unlocks rich, informed cross-sell & up-sell opportunities Creating Opportunity Data: Clickstream, Unstructured, Structured Retail Major home improvement retailer >$74B in revenue >300K employees >2,200 stores
  • 18. Page 18 Hortonworks © 2014 Monetize Anonymous & Aggregate Banking Data Problem Unable to unlock valuable cross-sell banking data Bank possesses data that indicates larger macro-economic trends, which can be monetized in secondary markets Data sets are isolated in legacy silos controlled by LOBs Regulations and company policies protect customer privacy IT challenged by joining data while guaranteeing anonymity Solution Create cross-LOB data lake of de-identified data Mortgage bankers, consumer bankers, credit card group and treasury bankers have access to the same cross-sell data Single point of security & privacy for de-identification, masking, encryption, authentication and access control Interoperability with SAS, Red Hat & Splunk Creating Opportunity Data: Structured, Clickstream, Social & Unstructured Banking One of the largest US banks
  • 19. Page 19 Hortonworks © 2014 Improving Efficiency Data: SensorOptimize High-Tech Manufacturing Problem Ineffective root cause analysis on product defects 200 million digital storage devices manufactured yearly >10K faulty devices returned by customers every month Limited data available for root cause analysis means that diagnosing problems is highly manual (physical inspections) Subset of sensor data from QA testing retained 3-12 months Solution Created sensor data lake for 10x quality improvement Repository holds 24 months of data for each device Manufacturing dashboard allows >1,000 employees to search data, with results returned in less than 1 second Quality improved 10x: rate down to ~1K faulty devices / month Manufacturing Digital Storage Devices >$15B in revenue >85K employees
  • 20. Page 20 Hortonworks © 2014 Think Pigabyte, Not Petabyte
  • 21. Page 21 Hortonworks © 2014 Enabling Hadoop for the Enterprise Journey Capabili=es   Ensure  enterprise  capabili@es   are  delivered  in  100%  open   source  to  benefit  all   1 2Integra=on   Interoperable  with  exis@ng     data  center  investments   Skills   Leverage  your  exis@ng  skills:   development,  analy@cs,   opera@ons    3 Scale Scope More data and analytic apps New Analytic Apps New types of data LOB-driven A Modern Data Architecture   RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management
  • 22. Page 22 Hortonworks © 2014 Try Hadoop Today… Get Involved Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 San Jose, CA June 3 - 5, 2014 REGISTER NOW Amsterdam April 2 - 3, 2014 REGISTER NOW
  • 23. Page 23 Hortonworks © 2014 Questions? @shaunconnolly