Supporting Financial Services
With a More Flexible Approach
to Big Data
October 21, 2014
WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Our Presenters
Jus$n	
  Sears	
  is	
  a	
  Product	
  Marke$ng	
  Manager	
  at	
  Hortonworks,	
  where	
  he	
  writes	
  stories	
  
about	
  how	
  enterprise	
  customers	
  use	
  Apache	
  Hadoop	
  to	
  solve	
  big	
  data	
  business	
  
challenges.	
  He	
  also	
  manages	
  product	
  launch	
  marke$ng	
  and	
  campaign	
  content	
  for	
  
Hortonworks.	
  For	
  seventeen	
  years,	
  Jus$n	
  has	
  led	
  teams	
  in	
  Silicon	
  Valley	
  to	
  create	
  
and	
  posi$on	
  enterprise	
  soCware,	
  risk-­‐controlled	
  consumer	
  banking	
  products,	
  
desktop	
  and	
  mobile	
  web	
  proper$es,	
  and	
  services	
  for	
  La$no	
  customers	
  in	
  the	
  US	
  and	
  
La$n	
  America.	
  He	
  lives	
  with	
  his	
  family	
  in	
  his	
  na$ve	
  San	
  Francisco	
  Bay	
  Area.	
  
BreH	
  Rudenstein	
  has	
  an	
  extensive	
  background	
  in	
  Applica$on	
  Lifecycle	
  Management,	
  
High	
  Performance	
  Compu$ng	
  and	
  Open	
  Source	
  SoCware	
  Analysis.	
  He	
  has	
  held	
  senior	
  
sales	
  engineering	
  and	
  management	
  posi$ons	
  at	
  Ra$onal	
  SoCware,	
  PureAtria,	
  
IBM,	
  Appistry	
  and	
  Palamida.	
  Throughout	
  his	
  career,	
  he	
  has	
  enabled	
  organiza$ons	
  to	
  
accelerate	
  technology	
  adop$on	
  by	
  understanding	
  their	
  needs	
  and	
  providing	
  just-­‐in-­‐
$me	
  business	
  solu$ons.	
  As	
  WANdisco	
  Director	
  of	
  Product	
  Management	
  for	
  Big	
  
Data,	
  BreH	
  works	
  with	
  partners,	
  prospects	
  and	
  customers	
  to	
  help	
  
them	
  understand	
  and	
  evolve	
  the	
  requirements	
  for	
  enterprise-­‐ready	
  Hadoop.	
  
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks
We Do Hadoop
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Our Mission:
Power your Modern Data Architecture
with HDP and Enterprise Apache Hadoop
Who we are
June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo!
June 2014: An enterprise software company with 420+ Employees
Key Partners
Our model
Innovate and deliver Apache Hadoop as a complete enterprise data platform
completely in the open, backed by a world class support organization
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Fastest growing Fortune 1000 customer base
Customer Momentum
•  300+ customers in seven quarters, growing at 75+/quarter
•  Two thirds of customers come from F1000
•  100% renewal rate
Largest Cluster in North America
32,000 Nodes
Largest Cluster in Europe
1,000 Nodes
Some notable migrations include many of the early adopters of Hadoop:
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Experience at Scale
80,000 nodes under contract
Largest Known Cluster in APAC
400 Nodes
30+ customers migrated from other distributions
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks: A Leader In Hadoop
The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014
“Hortonworks loves and lives
open source innovation”
Vision & Execution for Enterprise Hadoop.
Hortonworks leads with a strong strategy and roadmap for open source innovation
with Hadoop and a strong delivery of that innovation in Hortonworks Data Platform.
World Class Support and Services.
Hortonworks' Customer Support received a maximum score
and was significantly higher than both Cloudera and MapR.
Key Strategic Partnerships.
Hortonworks’ unique strategic partnerships with Microsoft, SAP, Teradata and others
are a key strength as part of its overall strategy of ecosystem partnership to
accelerate Hadoop adoption in the enterprise.
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
HDP
•  Reliable
•  Consistent
•  Current
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enabling a Modern Data Architecture
with HDP and Apache Hadoop
Hortonworks. We do Hadoop.
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
Traditional systems under pressure
•  Silos of Data
•  Costly to Scale
•  Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor, Machine Data
Unstructured docs, emails
Server logs
SOURCES
Existing Sources
(CRM, ERP,…)
RDBMS EDW MPP
New Data Types
…and difficult to
manage new data
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Traditional Hadoop, challenges & limitations
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
SOURCES
EXISTING	
  
Systems	
  
Clickstream	
   Web	
  &Social	
   Geoloca9on	
   Sensor	
  &	
  
Machine	
  
Server	
  Logs	
   Unstructured	
  
Architectural Limitations
•  Single-purpose clusters, specific data sets
•  Primarily a batch system using MapReduce
Enterprise Challenges
•  Limited enterprise capabilities: 

Operations, Security & Governance
•  Created additional Silos

Interoperability Challenges
•  Difficult to natively integrate existing applications

Commercial add-ons opportunistically emerged 

in the early days to address these shortcomings
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
20092006
1	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
MapReduce	
  
Largely	
  Batch	
  Processing	
  
Hadoop	
  w/	
  MapReduce
YARN: Data Operating System
1
 °
 °
 °
 °
 °
 °
 °
 °
 °
°
 °
 °
 °
 °
 °
 °
 °
 °
°
°
N
HDFS 

(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Siloed clusters
Largely batch system
Difficult to integrate
MR-­‐279:	
  YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected & 

led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP2 and YARN enable the Modern Data Architecture
Hortonworks architected and 

led development of YARN
Common data set, multiple applications
•  Optionally land all data in a single cluster
•  Batch, interactive & real-time use cases
•  Support multi-tenant access, processing
& segmentation of data
YARN: Architectural center of Hadoop
•  Consistent security, governance & operations
•  Ecosystem applications certified 

by Hortonworks to run natively in Hadoop
SOURCES
EXISTING	
  
Systems	
  
Clickstream	
   Web	
  	
  
&Social	
  
Geoloca9on	
   Sensor	
  	
  
&	
  Machine	
  
Server	
  	
  
Logs	
  
Unstructured	
  
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Blueprint for Enterprise Hadoop
Load data
and manage
according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
DATA MANAGEMENT
SECURITYDATA ACCESS
GOVERNANCE
& INTEGRATION
OPERATIONS
Enable both existing and new applications to
provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS
YARN Data Operating System
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Deployment ChoiceLinux Windows On-
Premises
Cloud
YARN is the architectural
center of HDP
•  Common data set across all
applications
•  Batch, interactive & real-time
workloads
•  Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
•  Governance
•  Security
•  Operations
Enables broad
ecosystem adoption
•  ISVs can plug directly into Hadoop
The widest range of deployment options
•  Linux & Windows
•  On-premises & cloud
Others
ISV
Engines
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
The Modern Data Architecture w/ HDP
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Clickstream
Capture and analyze
website visitors’ data
trails and optimize
your website
Sensors
Discover patterns in
data streaming
automatically from
remote sensors and
machines
Server Logs
Research logs to
diagnose process
failures and prevent
security breaches
New Types of DataHadoop Value:
Sentiment
Understand how
your customers feel
about your brand
and products –
right now
Geographic
Analyze location-
based data to
manage operations
where they occur
Unstructured
Understand patterns
in files across millions
of web pages, emails,
and documents
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New analytic applications for new types of data
$
•  Supplier Consolidation
•  Supply Chain and Logistics
•  Assembly Line Quality Assurance
•  Proactive Maintenance
•  Crowdsourced Quality Assurance
•  New Account Risk Screens
•  Fraud Prevention
•  Trading Risk
•  Maximize Deposit Spread
•  Insurance Underwriting
•  Accelerate Loan Processing
•  Call Detail Records (CDRs)
•  Infrastructure Investment
•  Next Product to Buy (NPTB)
•  Real-time Bandwidth
Allocation
•  New Product Development
•  360° View of the Customer
•  Analyze Brand Sentiment
•  Localized, Personalized
Promotions
•  Website Optimization
•  Optimal Store Layout
Financial
Services
Retail Telecom Manufacturing
Healthcare
Utilities,
Oil & Gas
Public
Sector
•  Genomic data for medical trials
•  Monitor patient vitals
•  Reduce re-admittance rates
•  Store medical research data
•  Recruit cohorts for
pharmaceutical trials
•  Smart meter stream analysis
•  Slow oil well decline curves
•  Optimize lease bidding
•  Compliance reporting
•  Proactive equipment repair
•  Seismic image processing
•  Analyze public sentiment
•  Protect critical networks
•  Prevent fraud and waste
•  Crowdsource reporting for
repairs to infrastructure
•  Fulfill open records requests
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
..to shift from reactive to proactive interactions
HDP and Hadoop allow
organizations to shift
interactions from…
Reactive
Post Transaction
Proactive
Pre Decision
…to Real-time PersonalizationFrom static branding
…to repair before breakFrom break then fix
…to Designer MedicineFrom mass treatment
…to Automated AlgorithmsFrom Educated Investing
…to 1x1 TargetingFrom mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Telco
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Lake: An architectural shift
SCALE
SCOPE
Unlocking the Data Lake
	
  
RDBMS
MPP
EDW
Data Lake
Enabled by YARN
•  Single data repository,
shared infrastructure
•  Multiple biz apps
accessing all the data
•  Enable a shift from
reactive to proactive
interactions
•  Gain new insight across
the entire enterprise
New Analytic Apps
or IT Optimization
HDP 2.1
Governance
&Integration
Security
Operations
Data Access
Data Management
YARN
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL	
  TOOLS	
  
DEV	
  &	
  DATA	
  TOOLS	
  
INFRASTRUCTURE	
  
HDP is deeply integrated in the data centerSOURCES
EXISTING	
  
Systems	
  
Clickstream	
   Web	
  &Social	
   Geoloca9on	
   Sensor	
  &	
  
Machine	
  
Server	
  Logs	
   Unstructured	
  
DATASYSTEM
RDBMS	
   EDW	
   MPP	
  
HANA
APPLICATIONS	
  
BusinessObjects BI
Deep Partnerships
Hortonworks engages
in deep engineered relationships
with the leaders in the data center,
such as Microsoft, Teradata, Redhat,
HP, SAS & SAP
Broad Partnerships
Over 600 partners work with us to
certify their applications to work with
Hadoop so they can extend big data
to their users
HDP 2.1
Governance
&Integration
Security
Operations
Data Access
Data Management
YARN
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP Use Cases in Financial Services
Hortonworks. We do Hadoop.
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Monetize Anonymous & Aggregate Banking Data
Problem
Valuable banking data needed to be anonymous & unified
•  Bank possesses data that indicates larger macro-economic trends, which can be
monetized in secondary markets
•  Regulations and company policies protect customer privacy
•  Data sets are isolated in legacy silos controlled by LOBs
•  IT challenged by joining data while guaranteeing anonymity
Solution
Cross-bank data lake for aggregate data with secure access
•  Multiple data sets abstracted from source platforms
•  Single point of security & privacy for de-identification, masking, encryption,
authentication and access control
•  Mortgage bankers, consumer bankers, credit card group and treasury bankers have
access to the same cross-sell data
•  Interoperability with partners SAS, R, RedHat & Splunk
•  Economies of scale for compression & archiving data
•  Significant reduction in storage costs from prior platforms
Creating Opportunity
Data: Structured,
Clickstream, Social &
Unstructured
Banking
One of the largest US banks
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Insurance Data Lake to Manage Risk
Problem
Challenges merging new & old data hamper analysis
•  Traditional and newer types of data were both growing quickly but were difficult to
combine in the EDW
•  “Schema on load” requirements of EDW platform limited ingest of some data with
significant predictive power
•  Company missed data-driven ways to serve customers
•  Process of separating legitimate from fraudulent claims created “needle-in-a-
haystack” problem
Solution
Common platform for all types of data improves up-sell and reduces fraud
•  “Schema on read” Hadoop architecture means that more data sources can be
easily ingested to enrich predictive analytics
•  Agents use big data insights to determine the best action for valued customers and
recommend those in real-time
•  Claims analysts and underwriters process streaming data to quickly flag fraud risks
and fast-track legitimate claims
Creating Opportunity
Data: Structured,
Clickstream, Server Log
Health Insurance
Large US medical insurer
>$30B in revenue
>20M members
~35K employees
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Maintaining SLAs for Equity Trading Information
Problem
Meeting 12 millisecond SLAs for “ticker plant”
•  Daily ingest: 50GB server log data from 10,000 feeds
•  Four times daily, this data is pushed into DB2
•  Applications query this data 35K times per second
•  70% of queries are for data <1 year old, 30% for >1 year old
•  Current architecture can only hold 10 years of trading data
•  Growing volume puts performance at risk of missing SLAs
Solution
Meeting SLAs with confidence
•  HBase provides super-fast queries within SLA targets
•  ETL offloading to Hadoop allows longer data retention, without jeopardizing fast
response times
Improving Efficiency
Data: Server Log & ETL
Investment
Services
Highly trafficked website
providing business and
financial information
~15K employees
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop is a Platform Decision
Open Leadership
Drive innovation in the open via
the Apache community-driven
open source process
Enterprise Rigor
Engineer, test and certify
Apache Hadoop with the
enterprise in mind
Ecosystem Endorsement
Focus on deep integration with
existing data center technologies
and skills
Fastest Growing Customer and Partner Base
Largest and most experienced Hadoop adopters have standardized on Hortonworks
The data center leaders have standardized on Hortonworks
Supporting Financial Services with a More Flexible Approach to Big Data
27	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
WANdisco Background
•  WANdisco: Wide Area Network Distributed Computing
–  Enterprise-ready, high availability software solutions that enable globally distributed
organizations to meet today’s data challenges of secure storage, scalability and availability
•  Leader in tools for software engineers – Subversion
–  Apache Software Foundation sponsor
•  Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)
•  US patented active-active replication technology granted, November 2012
•  Global locations
–  San Ramon (CA)
–  Chengdu (China)
–  Tokyo (Japan)
–  Boston (MA)
–  Sheffield (UK)
–  Belfast (UK)
28	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Customers
29	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Non-Stop Hadoop
Non-Intrusive Plugin
to Hortonworks HDP
Provides Continuous Availability
In the LAN / Across the WAN
Active/Active
30	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
3 Problems For Sharing Data Across Clusters
LAN / WAN
31	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
•  Require Continuous Availability
–  SLA’s, regulatory compliance
•  Require HDFS to be Deployed Globally
–  Share data between data centers
–  Data is consistent, not eventual
•  Ease Administrative Burden
–  Reduce operational complexity
–  Simplify disaster recovery
–  Lower RTO/RPO
•  Allow Maximum Utilization of
Resources
–  Within the data center
–  Across data centers
Enterprise-Ready Hadoop
Characteristics of Mission-critical Financial Applications
32	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Single Standby
•  Inefficient utilization of resource
–  Journal Nodes
–  ZooKeeper Nodes
–  Standby Node
•  Performance Bottleneck
•  Still tied to the beeper
•  Limited to LAN scope
Breaking Away from Active/Passive
What’s in a NameNode
33	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Single Standby
•  Inefficient utilization of resource
–  Journal Nodes
–  ZooKeeper Nodes
–  Standby Node
•  Performance Bottleneck
•  Still tied to the beeper
•  Limited to LAN scope
Active / Active
•  All resources utilized
–  Only NameNode configuration
–  Scale as the cluster grows
–  All NameNodes active
•  Load balancing
•  Set resiliency (# of active NN)
•  Global Consistency
Breaking Away from Active/Passive
What’s in a NameNode
34	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Standby Data Center
•  Idle Resource
–  Single Data Center Ingest
–  Disaster Recovery Only
•  One way synchronization
–  DistCp
•  Error Prone
–  Clusters can diverge over time
•  Difficult to scale > 2 Data Centers
–  Complexity of sharing data
increases
Breaking Away from Active/Passive
What’s in a Data Center
35	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Standby Data Center
•  Idle Resource
–  Single Data Center Ingest
–  Disaster Recovery Only
•  One way synchronization
–  DistCp
•  Error Prone
–  Clusters can diverge over time
•  Difficult to scale > 2 Data Centers
–  Complexity of sharing data
increases
Active / Active
•  DR Resource Available
–  Ingest at all Data Centers
–  Run Jobs in both Data Centers
•  Replication is Multi-Directional
–  active/active
•  Absolute Consistency
–  Single HDFS spans locations
•  ‘N’ Data Center support
–  Global HDFS allows appropriate
data to be shared
Breaking Away from Active/Passive
What’s in a Data Center
Use Cases
37	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
•  Data is as current as possible (no
periodic synchs)
•  Doesn’t require monitoring and
consistency checking
•  Virtually zero downtime to recover
from regional data center failure
•  Regulatory compliance
Use Case: Disaster Recovery
38	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
•  Ingest and analyze anywhere
•  Analyze everywhere
–  Fraud detection
–  Equity trading information
–  New business
–  Etc…
•  Backup data center(s) can be used
for work
–  No idle resources
Use Case: Multi-Data Center
Ingest and multi-tenant workloads
39	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
•  Mixed Hardware Profiles
–  Memory, disk, CPU
–  Isolate memory-hungry
processing (Storm/Spark) from
regular jobs
•  Share data, not processing
–  Isolate lower priority (dev/
test) work
Use Case: Heterogeneous Hardware
In-memory analytics
40	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
The difficulty realizing the data lake…
41	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
…is that data spans the entire world
42	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Data	
  
Ocean	
  
Feeder	
  
Site	
  
Accoun$ng	
  
Mart	
  
Banking	
  
Mart	
  
•  Data Marts
–  Restrict access to relevant
data
–  Create quick clusters
•  Feeder Sites (Data
Tributaries)
–  Ingest only
Data Reservoir
Use Cases
43	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
•  Basel III
–  Consistency of data
•  Data Privacy Directive
–  Data sovereignty
•  Data doesn’t leave country of
origin
Compliance	
  
Regula$on	
  
Guidelines	
  
Regulatory Compliance
Technical Comparison
Hadoop Powered by WANdisco
45	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Periodic Synchronization
DistCp
Parallel Data Ingest
Load Balancer, Streaming
Multi-Data Center Hadoop Today
What's wrong with the status quo
46	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Periodic Synchronization
DistCp
Multi-Data Center Hadoop Today
Hacks currently in use
•  Runs as MapReduce
•  DR data center is read-only
•  Over time, Hadoop clusters
become inconsistent
•  Manual and labor-intensive
process to reconcile differences
•  Inefficient use of the network
47	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Parallel Data Ingest
Load Balancer, Flume
Multi-Data Center Hadoop Today
Hacks currently in use
•  Hiccups in either of the Hadoop
clusters causes the two file
systems to diverge
•  Potential to run out of buffer when
WAN is down
•  Requires constant attention and
sys-admin hours to keep running
•  Data created on the cluster is not
replicated
•  Use of streaming technologies
(like flume) for data redirection are
only for streaming
48	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Architecture of a Non-Stop Hadoop
49	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Q&A
Question and Answer
Submit your questions using the “ASK A QUESTION” button
50	
   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA
Thank you

More Related Content

PDF
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PDF
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
PDF
Enterprise Hadoop with Hortonworks and Nimble Storage
PDF
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
PDF
Hortonworks - What's Possible with a Modern Data Architecture?
PPTX
State of the Union with Shaun Connolly
PPTX
Don't Let Security Be The 'Elephant in the Room'
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Enterprise Hadoop with Hortonworks and Nimble Storage
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks - What's Possible with a Modern Data Architecture?
State of the Union with Shaun Connolly
Don't Let Security Be The 'Elephant in the Room'

What's hot (20)

PDF
Discover HDP 2.1: Apache Solr for Hadoop Search
PDF
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
PDF
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
PDF
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
PDF
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
PDF
Discover.hdp2.2.storm and kafka.final
PDF
Splunk-hortonworks-risk-management-oct-2014
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
PDF
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
PDF
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
PDF
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
PDF
Discover.hdp2.2.h base.final[2]
PPTX
Introduction to the Hortonworks YARN Ready Program
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
PDF
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
PPTX
YARN Ready: Integrating to YARN with Tez
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
PPTX
Hortonworks Yarn Code Walk Through January 2014
PPTX
Bigger Data For Your Budget
PDF
Apache Hadoop on the Open Cloud
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Discover.hdp2.2.storm and kafka.final
Splunk-hortonworks-risk-management-oct-2014
Hp Converged Systems and Hortonworks - Webinar Slides
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
Discover.hdp2.2.h base.final[2]
Introduction to the Hortonworks YARN Ready Program
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
YARN Ready: Integrating to YARN with Tez
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks Yarn Code Walk Through January 2014
Bigger Data For Your Budget
Apache Hadoop on the Open Cloud
Ad

Viewers also liked (20)

PPTX
Create a Smarter Data Lake with HP Haven and Apache Hadoop
PPTX
Hadoop and WANdisco: The Future of Big Data
PDF
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
PPTX
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
PPTX
Selective Data Replication with Geographically Distributed Hadoop
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
PDF
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
PDF
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
PDF
Hortonworks and Voltage Security webinar
PDF
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
PDF
Hortonworks, Novetta and Noble Energy Webinar
KEY
Large scale ETL with Hadoop
PDF
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
PDF
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
PDF
Non-Stop Hadoop for Hortonworks
PDF
Cloudian 451-hortonworks - webinar
PDF
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hadoop and WANdisco: The Future of Big Data
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Selective Data Replication with Geographically Distributed Hadoop
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Hortonworks and Voltage Security webinar
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
Hortonworks, Novetta and Noble Energy Webinar
Large scale ETL with Hadoop
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
Hadoop 2.0: YARN to Further Optimize Data Processing
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
Non-Stop Hadoop for Hortonworks
Cloudian 451-hortonworks - webinar
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Boost Performance with Scala – Learn From Those Who’ve Done It!
Ad

Similar to Supporting Financial Services with a More Flexible Approach to Big Data (20)

PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
PDF
Discover hdp 2.2 hdfs - final
PDF
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
PDF
Hortonworks & Bilot Data Driven Transformations with Hadoop
PDF
Hortonworks and Platfora in Financial Services - Webinar
PDF
Discover.hdp2.2.ambari.final[1]
PDF
Meetup oslo hortonworks HDP
PDF
Hortonworks Hadoop @ Oslo Hadoop User Group
PDF
Introduction to Hadoop
PDF
YARN - Strata 2014
PPTX
Realtime Analytics in Hadoop
PPTX
Realtime analytics + hadoop 2.0
PDF
Azure Cafe Marketplace with Hortonworks March 31 2016
PPTX
Hadoop In Action
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
PDF
How YARN Enables Multiple Data Processing Engines in Hadoop
PDF
Storm Demo Talk - Colorado Springs May 2015
PDF
Solving Big Data Problems using Hortonworks
PDF
IoT Crash Course Hadoop Summit SJ
Supporting Financial Services with a More Flexible Approach to Big Data
Discover hdp 2.2 hdfs - final
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Hortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks and Platfora in Financial Services - Webinar
Discover.hdp2.2.ambari.final[1]
Meetup oslo hortonworks HDP
Hortonworks Hadoop @ Oslo Hadoop User Group
Introduction to Hadoop
YARN - Strata 2014
Realtime Analytics in Hadoop
Realtime analytics + hadoop 2.0
Azure Cafe Marketplace with Hortonworks March 31 2016
Hadoop In Action
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
Storm Demo Talk - Colorado Springs May 2015
Solving Big Data Problems using Hortonworks
IoT Crash Course Hadoop Summit SJ

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
PDF
HDF 3.2 - What's New
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
PDF
Premier Inside-Out: Apache Druid
PDF
Accelerating Data Science and Real Time Analytics at Scale
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
PDF
Making Enterprise Big Data Small with Ease
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
PDF
Driving Digital Transformation Through Global Data Management
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Johns Hopkins - Using Hadoop to Secure Access Log Events
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
HDF 3.2 - What's New
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
IBM+Hortonworks = Transformation of the Big Data Landscape
Premier Inside-Out: Apache Druid
Accelerating Data Science and Real Time Analytics at Scale
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Making Enterprise Big Data Small with Ease
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Driving Digital Transformation Through Global Data Management
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Unlock Value from Big Data with Apache NiFi and Streaming CDC

Recently uploaded (20)

PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Flame analysis and combustion estimation using large language and vision assi...
Statistics on Ai - sourced from AIPRM.pdf
sustainability-14-14877-v2.pddhzftheheeeee
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Convolutional neural network based encoder-decoder for efficient real-time ob...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Lung cancer patients survival prediction using outlier detection and optimize...
Data Virtualization in Action: Scaling APIs and Apps with FME
MuleSoft-Compete-Deck for midddleware integrations
sbt 2.0: go big (Scala Days 2025 edition)
Early detection and classification of bone marrow changes in lumbar vertebrae...
Microsoft User Copilot Training Slide Deck
Advancing precision in air quality forecasting through machine learning integ...
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
future_of_ai_comprehensive_20250822032121.pptx
Rapid Prototyping: A lecture on prototyping techniques for interface design
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Training Program for knowledge in solar cell and solar industry
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
4 layer Arch & Reference Arch of IoT.pdf
Flame analysis and combustion estimation using large language and vision assi...

Supporting Financial Services with a More Flexible Approach to Big Data

  • 1. Supporting Financial Services With a More Flexible Approach to Big Data October 21, 2014
  • 2. WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Our Presenters Jus$n  Sears  is  a  Product  Marke$ng  Manager  at  Hortonworks,  where  he  writes  stories   about  how  enterprise  customers  use  Apache  Hadoop  to  solve  big  data  business   challenges.  He  also  manages  product  launch  marke$ng  and  campaign  content  for   Hortonworks.  For  seventeen  years,  Jus$n  has  led  teams  in  Silicon  Valley  to  create   and  posi$on  enterprise  soCware,  risk-­‐controlled  consumer  banking  products,   desktop  and  mobile  web  proper$es,  and  services  for  La$no  customers  in  the  US  and   La$n  America.  He  lives  with  his  family  in  his  na$ve  San  Francisco  Bay  Area.   BreH  Rudenstein  has  an  extensive  background  in  Applica$on  Lifecycle  Management,   High  Performance  Compu$ng  and  Open  Source  SoCware  Analysis.  He  has  held  senior   sales  engineering  and  management  posi$ons  at  Ra$onal  SoCware,  PureAtria,   IBM,  Appistry  and  Palamida.  Throughout  his  career,  he  has  enabled  organiza$ons  to   accelerate  technology  adop$on  by  understanding  their  needs  and  providing  just-­‐in-­‐ $me  business  solu$ons.  As  WANdisco  Director  of  Product  Management  for  Big   Data,  BreH  works  with  partners,  prospects  and  customers  to  help   them  understand  and  evolve  the  requirements  for  enterprise-­‐ready  Hadoop.  
  • 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks We Do Hadoop
  • 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Our Mission: Power your Modern Data Architecture with HDP and Enterprise Apache Hadoop Who we are June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo! June 2014: An enterprise software company with 420+ Employees Key Partners Our model Innovate and deliver Apache Hadoop as a complete enterprise data platform completely in the open, backed by a world class support organization
  • 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Fastest growing Fortune 1000 customer base Customer Momentum •  300+ customers in seven quarters, growing at 75+/quarter •  Two thirds of customers come from F1000 •  100% renewal rate Largest Cluster in North America 32,000 Nodes Largest Cluster in Europe 1,000 Nodes Some notable migrations include many of the early adopters of Hadoop: © Hortonworks Inc. 2011 – 2014. All Rights Reserved Experience at Scale 80,000 nodes under contract Largest Known Cluster in APAC 400 Nodes 30+ customers migrated from other distributions
  • 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks: A Leader In Hadoop The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014 “Hortonworks loves and lives open source innovation” Vision & Execution for Enterprise Hadoop. Hortonworks leads with a strong strategy and roadmap for open source innovation with Hadoop and a strong delivery of that innovation in Hortonworks Data Platform. World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR. Key Strategic Partnerships. Hortonworks’ unique strategic partnerships with Microsoft, SAP, Teradata and others are a key strength as part of its overall strategy of ecosystem partnership to accelerate Hadoop adoption in the enterprise.
  • 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation HDP •  Reliable •  Consistent •  Current
  • 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enabling a Modern Data Architecture with HDP and Apache Hadoop Hortonworks. We do Hadoop.
  • 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved APPLICATIONSDATASYSTEM Business Analytics Custom Applications Packaged Applications Traditional systems under pressure •  Silos of Data •  Costly to Scale •  Constrained Schemas Clickstream Geolocation Sentiment, Web Data Sensor, Machine Data Unstructured docs, emails Server logs SOURCES Existing Sources (CRM, ERP,…) RDBMS EDW MPP New Data Types …and difficult to manage new data
  • 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Traditional Hadoop, challenges & limitations 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing SOURCES EXISTING   Systems   Clickstream   Web  &Social   Geoloca9on   Sensor  &   Machine   Server  Logs   Unstructured   Architectural Limitations •  Single-purpose clusters, specific data sets •  Primarily a batch system using MapReduce Enterprise Challenges •  Limited enterprise capabilities: 
 Operations, Security & Governance •  Created additional Silos Interoperability Challenges •  Difficult to natively integrate existing applications Commercial add-ons opportunistically emerged 
 in the early days to address these shortcomings APPLICATIONSDATASYSTEM Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP
  • 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 20092006 1   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   MapReduce   Largely  Batch  Processing   Hadoop  w/  MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS 
 (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Siloed clusters Largely batch system Difficult to integrate MR-­‐279:  YARN Hadoop 2 & YARN Interactive Real-TimeBatch Architected & 
 led development of YARN to enable the Modern Data Architecture October 23, 2013
  • 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP2 and YARN enable the Modern Data Architecture Hortonworks architected and 
 led development of YARN Common data set, multiple applications •  Optionally land all data in a single cluster •  Batch, interactive & real-time use cases •  Support multi-tenant access, processing & segmentation of data YARN: Architectural center of Hadoop •  Consistent security, governance & operations •  Ecosystem applications certified 
 by Hortonworks to run natively in Hadoop SOURCES EXISTING   Systems   Clickstream   Web     &Social   Geoloca9on   Sensor     &  Machine   Server     Logs   Unstructured   APPLICATIONSDATASYSTEM Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Interactive Real-TimeBatch
  • 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved A Blueprint for Enterprise Hadoop Load data and manage according to policy Deploy and effectively manage the platform Store and process all of your Corporate Data Assets Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection DATA MANAGEMENT SECURITYDATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS Enable both existing and new applications to provide value to the organization PRESENTATION & APPLICATION Empower existing operations and security tools to manage Hadoop ENTERPRISE MGMT & SECURITY Provide deployment choice across physical, virtual, cloud DEPLOYMENT OPTIONS YARN Data Operating System
  • 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks Data Platform 2.2 HDP Delivers Enterprise Hadoop YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Deployment ChoiceLinux Windows On- Premises Cloud YARN is the architectural center of HDP •  Common data set across all applications •  Batch, interactive & real-time workloads •  Multi-tenant access & processing Provides comprehensive enterprise capabilities •  Governance •  Security •  Operations Enables broad ecosystem adoption •  ISVs can plug directly into Hadoop The widest range of deployment options •  Linux & Windows •  On-premises & cloud Others ISV Engines
  • 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved The Modern Data Architecture w/ HDP
  • 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Clickstream Capture and analyze website visitors’ data trails and optimize your website Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches New Types of DataHadoop Value: Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location- based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  • 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved New analytic applications for new types of data $ •  Supplier Consolidation •  Supply Chain and Logistics •  Assembly Line Quality Assurance •  Proactive Maintenance •  Crowdsourced Quality Assurance •  New Account Risk Screens •  Fraud Prevention •  Trading Risk •  Maximize Deposit Spread •  Insurance Underwriting •  Accelerate Loan Processing •  Call Detail Records (CDRs) •  Infrastructure Investment •  Next Product to Buy (NPTB) •  Real-time Bandwidth Allocation •  New Product Development •  360° View of the Customer •  Analyze Brand Sentiment •  Localized, Personalized Promotions •  Website Optimization •  Optimal Store Layout Financial Services Retail Telecom Manufacturing Healthcare Utilities, Oil & Gas Public Sector •  Genomic data for medical trials •  Monitor patient vitals •  Reduce re-admittance rates •  Store medical research data •  Recruit cohorts for pharmaceutical trials •  Smart meter stream analysis •  Slow oil well decline curves •  Optimize lease bidding •  Compliance reporting •  Proactive equipment repair •  Seismic image processing •  Analyze public sentiment •  Protect critical networks •  Prevent fraud and waste •  Crowdsource reporting for repairs to infrastructure •  Fulfill open records requests
  • 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved ..to shift from reactive to proactive interactions HDP and Hadoop allow organizations to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-time PersonalizationFrom static branding …to repair before breakFrom break then fix …to Designer MedicineFrom mass treatment …to Automated AlgorithmsFrom Educated Investing …to 1x1 TargetingFrom mass branding A shift in Advertising A shift in Financial Services A shift in Healthcare A shift in Retail A shift in Telco
  • 19. Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Lake: An architectural shift SCALE SCOPE Unlocking the Data Lake   RDBMS MPP EDW Data Lake Enabled by YARN •  Single data repository, shared infrastructure •  Multiple biz apps accessing all the data •  Enable a shift from reactive to proactive interactions •  Gain new insight across the entire enterprise New Analytic Apps or IT Optimization HDP 2.1 Governance &Integration Security Operations Data Access Data Management YARN
  • 20. Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved OPERATIONAL  TOOLS   DEV  &  DATA  TOOLS   INFRASTRUCTURE   HDP is deeply integrated in the data centerSOURCES EXISTING   Systems   Clickstream   Web  &Social   Geoloca9on   Sensor  &   Machine   Server  Logs   Unstructured   DATASYSTEM RDBMS   EDW   MPP   HANA APPLICATIONS   BusinessObjects BI Deep Partnerships Hortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, HP, SAS & SAP Broad Partnerships Over 600 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users HDP 2.1 Governance &Integration Security Operations Data Access Data Management YARN
  • 21. Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP Use Cases in Financial Services Hortonworks. We do Hadoop.
  • 22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Monetize Anonymous & Aggregate Banking Data Problem Valuable banking data needed to be anonymous & unified •  Bank possesses data that indicates larger macro-economic trends, which can be monetized in secondary markets •  Regulations and company policies protect customer privacy •  Data sets are isolated in legacy silos controlled by LOBs •  IT challenged by joining data while guaranteeing anonymity Solution Cross-bank data lake for aggregate data with secure access •  Multiple data sets abstracted from source platforms •  Single point of security & privacy for de-identification, masking, encryption, authentication and access control •  Mortgage bankers, consumer bankers, credit card group and treasury bankers have access to the same cross-sell data •  Interoperability with partners SAS, R, RedHat & Splunk •  Economies of scale for compression & archiving data •  Significant reduction in storage costs from prior platforms Creating Opportunity Data: Structured, Clickstream, Social & Unstructured Banking One of the largest US banks
  • 23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Insurance Data Lake to Manage Risk Problem Challenges merging new & old data hamper analysis •  Traditional and newer types of data were both growing quickly but were difficult to combine in the EDW •  “Schema on load” requirements of EDW platform limited ingest of some data with significant predictive power •  Company missed data-driven ways to serve customers •  Process of separating legitimate from fraudulent claims created “needle-in-a- haystack” problem Solution Common platform for all types of data improves up-sell and reduces fraud •  “Schema on read” Hadoop architecture means that more data sources can be easily ingested to enrich predictive analytics •  Agents use big data insights to determine the best action for valued customers and recommend those in real-time •  Claims analysts and underwriters process streaming data to quickly flag fraud risks and fast-track legitimate claims Creating Opportunity Data: Structured, Clickstream, Server Log Health Insurance Large US medical insurer >$30B in revenue >20M members ~35K employees
  • 24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Maintaining SLAs for Equity Trading Information Problem Meeting 12 millisecond SLAs for “ticker plant” •  Daily ingest: 50GB server log data from 10,000 feeds •  Four times daily, this data is pushed into DB2 •  Applications query this data 35K times per second •  70% of queries are for data <1 year old, 30% for >1 year old •  Current architecture can only hold 10 years of trading data •  Growing volume puts performance at risk of missing SLAs Solution Meeting SLAs with confidence •  HBase provides super-fast queries within SLA targets •  ETL offloading to Hadoop allows longer data retention, without jeopardizing fast response times Improving Efficiency Data: Server Log & ETL Investment Services Highly trafficked website providing business and financial information ~15K employees
  • 25. Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop is a Platform Decision Open Leadership Drive innovation in the open via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Fastest Growing Customer and Partner Base Largest and most experienced Hadoop adopters have standardized on Hortonworks The data center leaders have standardized on Hortonworks
  • 27. 27   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA WANdisco Background •  WANdisco: Wide Area Network Distributed Computing –  Enterprise-ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability •  Leader in tools for software engineers – Subversion –  Apache Software Foundation sponsor •  Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) •  US patented active-active replication technology granted, November 2012 •  Global locations –  San Ramon (CA) –  Chengdu (China) –  Tokyo (Japan) –  Boston (MA) –  Sheffield (UK) –  Belfast (UK)
  • 28. 28   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Customers
  • 29. 29   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Non-Stop Hadoop Non-Intrusive Plugin to Hortonworks HDP Provides Continuous Availability In the LAN / Across the WAN Active/Active
  • 30. 30   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA 3 Problems For Sharing Data Across Clusters LAN / WAN
  • 31. 31   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA •  Require Continuous Availability –  SLA’s, regulatory compliance •  Require HDFS to be Deployed Globally –  Share data between data centers –  Data is consistent, not eventual •  Ease Administrative Burden –  Reduce operational complexity –  Simplify disaster recovery –  Lower RTO/RPO •  Allow Maximum Utilization of Resources –  Within the data center –  Across data centers Enterprise-Ready Hadoop Characteristics of Mission-critical Financial Applications
  • 32. 32   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Single Standby •  Inefficient utilization of resource –  Journal Nodes –  ZooKeeper Nodes –  Standby Node •  Performance Bottleneck •  Still tied to the beeper •  Limited to LAN scope Breaking Away from Active/Passive What’s in a NameNode
  • 33. 33   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Single Standby •  Inefficient utilization of resource –  Journal Nodes –  ZooKeeper Nodes –  Standby Node •  Performance Bottleneck •  Still tied to the beeper •  Limited to LAN scope Active / Active •  All resources utilized –  Only NameNode configuration –  Scale as the cluster grows –  All NameNodes active •  Load balancing •  Set resiliency (# of active NN) •  Global Consistency Breaking Away from Active/Passive What’s in a NameNode
  • 34. 34   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Standby Data Center •  Idle Resource –  Single Data Center Ingest –  Disaster Recovery Only •  One way synchronization –  DistCp •  Error Prone –  Clusters can diverge over time •  Difficult to scale > 2 Data Centers –  Complexity of sharing data increases Breaking Away from Active/Passive What’s in a Data Center
  • 35. 35   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Standby Data Center •  Idle Resource –  Single Data Center Ingest –  Disaster Recovery Only •  One way synchronization –  DistCp •  Error Prone –  Clusters can diverge over time •  Difficult to scale > 2 Data Centers –  Complexity of sharing data increases Active / Active •  DR Resource Available –  Ingest at all Data Centers –  Run Jobs in both Data Centers •  Replication is Multi-Directional –  active/active •  Absolute Consistency –  Single HDFS spans locations •  ‘N’ Data Center support –  Global HDFS allows appropriate data to be shared Breaking Away from Active/Passive What’s in a Data Center
  • 37. 37   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA •  Data is as current as possible (no periodic synchs) •  Doesn’t require monitoring and consistency checking •  Virtually zero downtime to recover from regional data center failure •  Regulatory compliance Use Case: Disaster Recovery
  • 38. 38   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA •  Ingest and analyze anywhere •  Analyze everywhere –  Fraud detection –  Equity trading information –  New business –  Etc… •  Backup data center(s) can be used for work –  No idle resources Use Case: Multi-Data Center Ingest and multi-tenant workloads
  • 39. 39   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA •  Mixed Hardware Profiles –  Memory, disk, CPU –  Isolate memory-hungry processing (Storm/Spark) from regular jobs •  Share data, not processing –  Isolate lower priority (dev/ test) work Use Case: Heterogeneous Hardware In-memory analytics
  • 40. 40   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA The difficulty realizing the data lake…
  • 41. 41   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA …is that data spans the entire world
  • 42. 42   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Data   Ocean   Feeder   Site   Accoun$ng   Mart   Banking   Mart   •  Data Marts –  Restrict access to relevant data –  Create quick clusters •  Feeder Sites (Data Tributaries) –  Ingest only Data Reservoir Use Cases
  • 43. 43   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA •  Basel III –  Consistency of data •  Data Privacy Directive –  Data sovereignty •  Data doesn’t leave country of origin Compliance   Regula$on   Guidelines   Regulatory Compliance
  • 45. 45   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Periodic Synchronization DistCp Parallel Data Ingest Load Balancer, Streaming Multi-Data Center Hadoop Today What's wrong with the status quo
  • 46. 46   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Periodic Synchronization DistCp Multi-Data Center Hadoop Today Hacks currently in use •  Runs as MapReduce •  DR data center is read-only •  Over time, Hadoop clusters become inconsistent •  Manual and labor-intensive process to reconcile differences •  Inefficient use of the network
  • 47. 47   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Parallel Data Ingest Load Balancer, Flume Multi-Data Center Hadoop Today Hacks currently in use •  Hiccups in either of the Hadoop clusters causes the two file systems to diverge •  Potential to run out of buffer when WAN is down •  Requires constant attention and sys-admin hours to keep running •  Data created on the cluster is not replicated •  Use of streaming technologies (like flume) for data redirection are only for streaming
  • 48. 48   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Architecture of a Non-Stop Hadoop
  • 49. 49   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Q&A Question and Answer Submit your questions using the “ASK A QUESTION” button
  • 50. 50   WWW.WANDISCO.COMREALIZING THE POSSIBILITIES OF BIG DATA Thank you