SlideShare a Scribd company logo
Shaun Connolly
Hortonworks VP Strategy
@shaunconnolly
Hadoop Powers Modern
Enterprise Data Architectures
By 2015, Organizations that
Build a Modern Information
Management System Will
Outperform their Peers
Financially by 20 Percent.
– Gartner, Mark Beyer, “Information Management in the 21st Century”
New Sources
(sentiment, clickstream, geo, sensor, …)
Traditional Data ArchitectureAPPLICATIO
NS
DATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATA
SOURCES
OLTP, POS SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
Pressured
TRADITIONAL REPOS
RDBMS EDW MPP
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD & TEST
Traditional Sources
(RDBMS, OLTP, OLAP)
PressuredTraditional Data Architecture
Source: IDC
New Sources
(sentiment, clickstream, geo, sensor, …)
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
New Sources
(sentiment, clickstream, geo, sensor, …)
Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMS
DATA
SOURCES
OLTP, POS
SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
TRADITIONAL REPOS
RDBMS EDW MPP
Traditional Sources
(RDBMS, OLTP, OLAP)
MANAGE &
MONITOR
OPERATIONAL
TOOLS
BUILD & TEST
DEV & DATA
TOOLS
ENTERPRISE
HADOOP PLATFORM
Agile “Data Lake” Solution Architecture
Capture All Data Process & Structure
1 2 Distribute Results
3 Feedback & Retain
4
Dashboards,
Reports,
Visualization, …
Web, Mobile,
CRM, ERP,
Point of sale
Business
Transactions
& Interactions
Business
Intelligence
& Analytics
Classic Data
Integration & ETL
Logs & Text Data
Sentiment Data
Structured
DB Data
Clickstream Data
Geo & Tracking Data
Sensor & Machine Data
Enterprise
Hadoop
Platform
BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER…
Key Requirement of a “Data Lake”
Store ALL DATA in one place…
…and Interact with that data in MULTIPLE WAYS
HDFS (Redundant, Reliable Storage)
Applications Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
IN-MEMORY
Spark
HPC MPI
OpenMPI
ONLINE
HBase
OTHER…
ex. Search
YARN Takes Hadoop Beyond Batch
Applications run “IN” Hadoop versus “ON” Hadoop…
…with Predictable Performance and Quality of Service
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
Ex. SQL-IN-Hadoop with Apache Hive
Stinger Initiative
Focus Areas
Make Hive 100X Faster
Make Hive SQL Compliant HDFS2
YARN
HIVE
SQL
MAP
REDUCE
Business
Analytics
Custom
Apps
TEZ
Making Hadoop Enterprise Ready
OS/VM Cloud Appliance
Enterprise Hadoop Platform
PLATFORM
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery,Security and
Snapshots
OPERATIONAL
SERVICES
Manage & Operate
at Scale
DATA
SERVICES
Store, Process
and Access Data
CORE
Distributed
Storage & Processing
Mohit Saxena
VP & Technology Founder
Managing and Processing Data at Scale
and Across Datacenters
UA2
Ad Servers
Click Servers
Beacon Servers
Fraud Service
Global RTFB
LHR1
Ad Servers
Click Servers
Beacon Servers
UJ1
Ad Servers
Click Servers
Beacon Servers
Billing Service
Download Servers
HKG1
Ad Servers
Click Servers
Beacon Servers
UA2-Ruby
RAW Logs
UA2-Global
RAW Logs
LHR1-Emerald
RAW Logs
UJ1-Topaz
RAW Logs
HKG1-Opal
Summaries
InMobi contributed Apache Falcon
to address Hadoop
data lifecycle management
Innovate
Participate
Integrate
Many Communities Must Work As One
Open
Source
End
Users
Vendors
Ecosystem Completes the Puzzle
Data Systems
Applications, Business Tools, & Dev Tools
Infrastructure & Systems Management
Thank You to Our Sponsors
Data Systems
Applications, Business Tools, & Dev Tools
Infrastructure & Systems Management
Hadoop Wave ONE: Web-scale Batch Apps
time
relative%
customers
Customers want
solutions & convenience
Customers want
technology & performance
Source: Geoffrey Moore - Crossing the Chasm
2006 to 2012
Web-Scale
Batch Applications
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
Customers want
solutions & convenience
Customers want
technology & performance
Hadoop Wave TWO: Broad Enterprise Apps
time
relative%
customers
Source: Geoffrey Moore - Crossing the Chasm
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
2013 & Beyond
Batch, Interactive, Online,
Streaming, etc., etc.
Hadoop Powers Modern Enterprise Data Architectures

More Related Content

What's hot (20)

PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
PDF
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
PDF
Data Lake for the Cloud: Extending your Hadoop Implementation
Hortonworks
 
PPTX
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
PPTX
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
PDF
Building a Data Lake - An App Dev's Perspective
GeekNightHyderabad
 
PDF
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
PPTX
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Hortonworks
 
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
PDF
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
PDF
Designing the Next Generation Data Lake
Robert Chong
 
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
PDF
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
PDF
5 Steps for Architecting a Data Lake
MetroStar
 
PPTX
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
PDF
Planing and optimizing data lake architecture
Milos Milovanovic
 
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Hortonworks
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
Building a Data Lake - An App Dev's Perspective
GeekNightHyderabad
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Hortonworks
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Designing the Next Generation Data Lake
Robert Chong
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
5 Steps for Architecting a Data Lake
MetroStar
 
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
Planing and optimizing data lake architecture
Milos Milovanovic
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
PDF
Building a Modern Data Architecture with Enterprise Hadoop
Slim Baltagi
 
PDF
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
PPTX
Big data connection overview by aibdp.org
AIBDP
 
PDF
1 to 1 Presentation 2015
James Puliatte
 
PPTX
Hadoop hive
Wei-Yu Chen
 
PPTX
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
Wei-Yu Chen
 
PDF
Hadoop pig
Wei-Yu Chen
 
PDF
Hadoop 2.0 之古往今來
Wei-Yu Chen
 
PDF
Hadoop ecosystem - hadoop 生態系
Wei-Yu Chen
 
PDF
大資料趨勢介紹與相關使用技術
Wei-Yu Chen
 
PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
PPTX
Hadoop and Hive in Enterprises
markgrover
 
PPTX
Developing Hadoop strategy for your Enterprise
Avkash Chauhan
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PDF
A beginners guide to Cloudera Hadoop
David Yahalom
 
PPTX
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
PDF
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
PPTX
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
Building a Modern Data Architecture with Enterprise Hadoop
Slim Baltagi
 
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
Big data connection overview by aibdp.org
AIBDP
 
1 to 1 Presentation 2015
James Puliatte
 
Hadoop hive
Wei-Yu Chen
 
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
Wei-Yu Chen
 
Hadoop pig
Wei-Yu Chen
 
Hadoop 2.0 之古往今來
Wei-Yu Chen
 
Hadoop ecosystem - hadoop 生態系
Wei-Yu Chen
 
大資料趨勢介紹與相關使用技術
Wei-Yu Chen
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
Hadoop and Hive in Enterprises
markgrover
 
Developing Hadoop strategy for your Enterprise
Avkash Chauhan
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
A beginners guide to Cloudera Hadoop
David Yahalom
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Ad

Similar to Hadoop Powers Modern Enterprise Data Architectures (20)

PPTX
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Hortonworks
 
PDF
Introduction to Hadoop
POSSCON
 
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
PPTX
OOP 2014
Emil Andreas Siemes
 
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
Hortonworks
 
PPT
Hadoop India Summit, Feb 2011 - Informatica
Sanjeev Kumar
 
PDF
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
PDF
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Hortonworks
 
PDF
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PPT
Pervasive DataRush
templedf
 
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Hortonworks
 
PPTX
How Experian increased insights with Hadoop
Precisely
 
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Hortonworks
 
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PPTX
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
Revolution Analytics
 
PDF
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
PPTX
Apache hadoop for windows server and windwos azure
Brad Sarsfield
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Hortonworks
 
Introduction to Hadoop
POSSCON
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hortonworks
 
Hadoop India Summit, Feb 2011 - Informatica
Sanjeev Kumar
 
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Hortonworks
 
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Pervasive DataRush
templedf
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Hortonworks
 
How Experian increased insights with Hadoop
Precisely
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
Revolution Analytics
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Apache hadoop for windows server and windwos azure
Brad Sarsfield
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Digital Circuits, important subject in CS
contactparinay1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 

Hadoop Powers Modern Enterprise Data Architectures

  • 1. Shaun Connolly Hortonworks VP Strategy @shaunconnolly Hadoop Powers Modern Enterprise Data Architectures
  • 2. By 2015, Organizations that Build a Modern Information Management System Will Outperform their Peers Financially by 20 Percent. – Gartner, Mark Beyer, “Information Management in the 21st Century”
  • 3. New Sources (sentiment, clickstream, geo, sensor, …) Traditional Data ArchitectureAPPLICATIO NS DATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATA SOURCES OLTP, POS SYSTEMS Business Analytics Custom Applications Packaged Applications Pressured TRADITIONAL REPOS RDBMS EDW MPP OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Traditional Sources (RDBMS, OLTP, OLAP)
  • 4. PressuredTraditional Data Architecture Source: IDC New Sources (sentiment, clickstream, geo, sensor, …) 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020
  • 5. New Sources (sentiment, clickstream, geo, sensor, …) Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMS DATA SOURCES OLTP, POS SYSTEMS Business Analytics Custom Applications Packaged Applications TRADITIONAL REPOS RDBMS EDW MPP Traditional Sources (RDBMS, OLTP, OLAP) MANAGE & MONITOR OPERATIONAL TOOLS BUILD & TEST DEV & DATA TOOLS ENTERPRISE HADOOP PLATFORM
  • 6. Agile “Data Lake” Solution Architecture Capture All Data Process & Structure 1 2 Distribute Results 3 Feedback & Retain 4 Dashboards, Reports, Visualization, … Web, Mobile, CRM, ERP, Point of sale Business Transactions & Interactions Business Intelligence & Analytics Classic Data Integration & ETL Logs & Text Data Sentiment Data Structured DB Data Clickstream Data Geo & Tracking Data Sensor & Machine Data Enterprise Hadoop Platform
  • 7. BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER… Key Requirement of a “Data Lake” Store ALL DATA in one place… …and Interact with that data in MULTIPLE WAYS HDFS (Redundant, Reliable Storage)
  • 8. Applications Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez STREAMING Storm GRAPH Giraph IN-MEMORY Spark HPC MPI OpenMPI ONLINE HBase OTHER… ex. Search YARN Takes Hadoop Beyond Batch Applications run “IN” Hadoop versus “ON” Hadoop… …with Predictable Performance and Quality of Service HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management)
  • 9. Ex. SQL-IN-Hadoop with Apache Hive Stinger Initiative Focus Areas Make Hive 100X Faster Make Hive SQL Compliant HDFS2 YARN HIVE SQL MAP REDUCE Business Analytics Custom Apps TEZ
  • 10. Making Hadoop Enterprise Ready OS/VM Cloud Appliance Enterprise Hadoop Platform PLATFORM SERVICES Enterprise Readiness High Availability, Disaster Recovery,Security and Snapshots OPERATIONAL SERVICES Manage & Operate at Scale DATA SERVICES Store, Process and Access Data CORE Distributed Storage & Processing
  • 11. Mohit Saxena VP & Technology Founder
  • 12. Managing and Processing Data at Scale and Across Datacenters UA2 Ad Servers Click Servers Beacon Servers Fraud Service Global RTFB LHR1 Ad Servers Click Servers Beacon Servers UJ1 Ad Servers Click Servers Beacon Servers Billing Service Download Servers HKG1 Ad Servers Click Servers Beacon Servers UA2-Ruby RAW Logs UA2-Global RAW Logs LHR1-Emerald RAW Logs UJ1-Topaz RAW Logs HKG1-Opal Summaries
  • 13. InMobi contributed Apache Falcon to address Hadoop data lifecycle management
  • 14. Innovate Participate Integrate Many Communities Must Work As One Open Source End Users Vendors
  • 15. Ecosystem Completes the Puzzle Data Systems Applications, Business Tools, & Dev Tools Infrastructure & Systems Management
  • 16. Thank You to Our Sponsors Data Systems Applications, Business Tools, & Dev Tools Infrastructure & Systems Management
  • 17. Hadoop Wave ONE: Web-scale Batch Apps time relative% customers Customers want solutions & convenience Customers want technology & performance Source: Geoffrey Moore - Crossing the Chasm 2006 to 2012 Web-Scale Batch Applications Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics TheCHASM
  • 18. Customers want solutions & convenience Customers want technology & performance Hadoop Wave TWO: Broad Enterprise Apps time relative% customers Source: Geoffrey Moore - Crossing the Chasm Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics TheCHASM 2013 & Beyond Batch, Interactive, Online, Streaming, etc., etc.

Editor's Notes

  • #2: Thank you all for attending Hadoop Summit! For those who have attended previoiusHadoop Summits: Welcome back!For those new to Hadoop Summit: Welcome to the Hadoop herd!I’d like to spend the next 30 minutes focused on Hadoop’s opportunity to power modern enterprise data architectures. I’ve seen a lot of open source technologies and waves of IT change during my days at JBoss, Red Hat, SpringSource and VMware, but I’ve not seen anything quite like this Hadoop wave.We’re clearly at the forefront of a movement of something BIG, so savor the moment!Title: Hadoop Powers Modern Enterprise Data ArchitecturesBig data is everywhere and in many formats. We see it on commercials. We hear it in conversations over coffee. It is an expanding topic in the boardroom. At the center of the big data discussion is Apache Hadoop which has evolved from a tool for web-scale early adopters to an enterprise data platform that addresses the needs of mainstream businesses. In this talk Shaun Connolly, VP Corporate Strategy for Hortonworks, will discuss how Hadoop has given rise to a next-generation enterprise data architecture that is uniquely capable of storing, refining, and deriving new business insights from ALL types of data in a way that compliments existing enterprise systems and tools.Connolly will walk through how enterprises are utilizing Hadoop to refine and explore multi-structured information and enrich their applications with new insights. He will look at real-world use cases where Hadoop has helped produce more business value, augment productivity or identify new and potentially lucrative opportunities. Over the coming years, Hadoop could be in a position to process more than half the world's data. While there is much work to be done to achieve this lofty goal, Connolly will highlight how the community and broader solution ecosystem have made great strides towards solidifying Hadoop's place within the enterprise.
  • #3: Gartner talks about how the IT landscape is being changed by the Nexus of Forces: namely Mobile, Social, Cloud, and Information (aka Big Data). Hadoop is clearly an Information Management technology, but if you think about it, Hadoop has its massive legs in Mobil, Social, and Cloud. It’s certainly a unique technology!To frame up my talk, I chose this quote from Mark Beyer of Gartner:“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”Whether it’s opening up new business opportunities or outperforming your competitors by 20% or more, the important point to be made is that big data technologies offer very real and compelling BUSINESS and FINANCIAL value to go along with the innovative TECHNOLOGY that is able to do things never before possible.What I ALSO like about this quote is that it’s NOT a new quote. It was made about 1.5 years ago in late 2011!
  • #4: Let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. Your environment is undoubtedly more complicated, but conceptually it is likely similar. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database.[CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. So in the world of Big Data, we’ve got classic TRANSACTIONS and New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
  • #5: So let’s consider those NEW SOURCES of data and get a sense of the scope involved by considering some stats from IDC.[CLICK] According to IDC, 2.8ZB of data were created and replicated in 2012.A Zettabyte for those unfamiliar with the term is 1 BILLION Terabytes.[CLICK] 85% of that is from New Sources of Data.[CLICK] Out of that 85%, machine-generated data is a key driver in the growth and just that one new source of data is expected to grow by 15X by 2020.[CLICK] Fast-forward to 2020 and we’ll have 40 Zettabytes of data in the digital universe! This represents 50-fold growth from the beginning of 2010.[CLICK] Needless to say, wrestling that scale of data is like this poor guy trying to wrestle a champion Sumo athlete. Overwhelmed and outmatched to say the least. I’ve been using this graphic for the past 10 years or so. Given the world of big data we live in, I just had to trot this picture out once more. It just says it all, doesn’t it?
  • #6: As the volume of data has exploded, we’ve seen organizations acknowledge that not all data belongs in a traditional data system. The drivers are both cost and technology. As volumes grow, database licensing costs as well as the corresponding hardware costs can become prohibitive. And traditional databases are not ideal for handling very large datasets of varying data types. People want to store data quickly in its RAW format and apply structure and a schema later…after its been processed a bit more.Enter Enterprise Hadoop as a peer to traditional data systems. The momentum for Hadoop is NOT about replacing traditional databases. Rather it’s about adding it in to handle this big data problem and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
  • #7: In order to illustrate how Hadoop fits within the broader enterprise data architecture, I prefer to use a data flow diagram rather than the classic stack diagram we just covered.We are seeing may customers that want to deploy what we’ve been referring to as a “Data Lake” Solution Architecture that puts them in a position to maximize the value from ALL of their data: transactions + interactions + observations.At the highest level, we have three major areas of data processing, the first two of which are familiar to most enterprises:1. Business Transactions & Interactions2. Business Intelligence & AnalyticsEnterprise IT has been connecting systems via classic Data Integration and ETL processing, as illustrated in Step 1 above, for many years in order to deliver STRUCTURED and REPEATABLE analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.[CLICK] As we’ve discussed, New Data Sources representing Interactions and Observations have come onto the scene. And Enterprise Hadoop has appeared as a new system capable of capturing ALL of this multi-structured data into one place. Hadoop acts as a “Data Lake” if you will. Some call it a Data Reservoir, a Catch Basin, a Data Refinery, the foundation for a Data Hub & Spoke architecture. Regardless of name, it’s a place where ALL data can be brought together where it can then be flexibly aggregated and transformed into useful formats that help fuel new insights for the business. Structure and schema is applied when needed, NOT as a prerequisite before landing the data. [CLICK] The next step is about getting the data in the right format to those who need it. Some folks will cordon off ponds of data, to keep with our metaphor, for data scientists, researchers, or particular departments to interact with specific data of interest. Tools like Hive and HBase are commonly used for interacting with Hadoop data directly.Mainstream enterprises also benefit from integrating Enterprise Hadoop with their systems powering Business Transactions & Interactions and Business Intelligence & Analytics in order to open up the ability for them to get a richer and more informed 360 ̊ view of customers, for example. By directly integrating Enterprise Hadoop with Business Intelligence & Analytics solutions, companies can enhance their ability to more accurately understand the customer behaviors (aka Interactions) that lead to or inhibit their Transactions.Moreover, systems focused on Business Transactions & Interactions can benefit. Complex analytic models and calculations of key parameters can be performed in Hadoop and flow downstream to fuel online data systems powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.[CLICK] Since Hadoop is great at cost-effectively retaining large volumes of data for long periods of time, feedback loops enable a valuable closed-loop analytics system. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.A couple of final points before I move on:1. Capturing all data in Hadoop does not mean that your existing transaction and analytics applications need to be forklifted to run on top of Hadoop. The point here is that you can ALSO store data in Hadoop that’s in those systems. Yes, the data gets stored twice, but the flexibility and agility in doing so far exceeds the incremental expense….especially given the commodity nature of hardware that Hadoop uses.2. And one final point on the Data Lake. The goal isn’t to fill up Lake Superior right away. Most companies start with a small lake of data needed for targeted applications and over time, direct more and more streams of data into the lake. Let success beget more success.
  • #8: So as mainstream enterprises begin to store ALL of their data in one place, there’s a clear and growing desire to not only work with that data using classic, batch-oriented MapReduce, but a much wider range of interaction patterns.[CLICK] Interactive SQL solutions running on or next to Hadoop have gotten lots of press over recent months. Online data systems that store their data in HDFS are on the rise. As is Streaming and Complex Event Processing solutions, and Graph Processing. In-Memory Data Processing is another area. Even classic HPC Message Passing Interface apps are storing data in HDFS.The point here is that as enterprises store all data in one place, they increasingly need to interact with that data in a wide variety of ways.
  • #9: We are facing an exciting generational change in the Hadoop space.The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management and Task Tracking capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes all of that Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”. [CLICK] YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop.  [CLICK] Why is that important? Because businesses want the ability to run more applications on their Hadoop data, and do so with predictable performance and quality of service. Mixed workload management enables customers to protect against one application or user hogging cluster resources and starving the other applications running in the Hadoop cluster.  [CLICK] Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of other open source projects that are or will be leveraging YARN in the not so distant future. Apache Tez is a new framework that I’ll cover in a bit. Folks at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed as an Apache Software Foundation project. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  • #10: As I just mentioned, the topic of SQL for Hadoop has been a hot topic for the past 6 months or so. And rightly so. There are easily millions of people with SQL skills thatwould like to leverage those skills as they look to gain insight and value from data stored in Hadoop. With that as backdrop, at the beginning of the year, the Stinger Initiative was rolled out. It’s focus was to rally the Apache Hive community around the goals of making Hive 100X faster, so it can handle those interactive querying use cases, and making Hive more SQL compliant so its BI use cases are richer. Oh, and by the way, this work needs to happen in a way that PRESERVES Hive’s awesome capability of processing ginormous data sets. Well, Eric14 will cover the details of where the Stinger effort stands; it’s made awesome progress.What I wanted to highlight here was that as part of the Stinger Initiative effort, a new data processing framework has appeared to help handle the interactive querying use cases for Hive. This project is called Apache Tez and it helps eliminate needless HDFS writes that have traditionally slowed down Hive. Instead of a complex DAG of MapReduce steps, Tez helps create a Map-Reduce-Reduce paradigm that is much faster. The net-out of this is that Interactive SQL querying use cases can now run natively IN Hadoop since Tex is built on YARN. This helps ensure that Interactive Queries and classic MapReduce processing can coexist nicely within the same cluster with predictable performance and SLAs.
  • #11: So enterprise Hadoop lies at the heart of the next-generation data architecture.Let’s outline what’s required in and around Hadoop in order to make it easy to use and consume by the enterprise.At the center, we start with Apache Hadoop for distributed file storage and data processing (a la HDFS, MapReduce, and YARN).[CLICK] In order to enable Hadoop within mainstream enterprises, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And the community has been hard at work in both the 1.0 and 2.0 lines of Hadoop addressing these needs. There are also new incubator projects such as Apache Knox, that Eric will cover later, for improving user access to Hadoop clusters.[CLICK] And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily. This is where components like Apache Hive for SQL access, HCatalog for describing and managing your tables within Hadoop, Pig for script-based data processing, HBase for online data serving, Sqoop and Flume for getting data into Hadoop, etc.[CLICK] It’s also important…I would argue equally important…to make the platform easy to operate. Components like Apache Ambari for provisioning, management and monitoring of the cluster, Oozie for job & workflow scheduling and a new framework called Apache Falcon for Data Lifecycle Management fit here.[CLICK] So all of that: Core and Platform Services, Data Services, and Operational Services all come together into what I think of as “Enterprise Hadoop”.[CLICK] Ensuring that Enterprise Hadoop can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and VMware is important. Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important. As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps enterprises deploy Hadoop quickly, easily and in a familiar way.
  • #12: With that as backdrop, I’d like to talk about the need for better Data Lifecycle Management capabilities in Hadoop clusters. And to do so, I’d like to welcome MohitSaxena, the VP and Technology Founder of InMobi to the stage. For those unfamiliar with InMobi, they are a company focused on mobile advertising and have been recently voted one of 50 disruptive companies by MIT Technology Review. InMobi has been using Hadoop for many years and their technologists have been very active code contributors in the Apache Hadoop community. I’ve asked Mohit to join us today to share a little bit about how and why InMobi uses Hadoop and share some thoughts on how his team handles the challenge of managing data at scale and across datacenters.[SHAUN shakes Mohit’s hand and CLICKS to next slide]
  • #13: [SHAUN] Mohit, we’ve got a high level diagram of your data processing architecture. Why don’t you set some context for InMobi by sharing some of the impressive business metrics and Hadoop cluster metrics behind this picture:[MOHIT]~1.5 Trillion ads requested per year20 Billion messages streamed per year 2 Billion monetization events6 Clusters ranging from 40 to 250 nodes each20 Million Hadoop jobs submitted by users2 Billion MapReduce slots used in Hadoop[SHAUN]Pretty impressive solution architecture! One of the common questions I get from enterprise customers is how to deal with Data Lifecycle Management in Hadoop environments. You and your team addressed those needs by creating a framework that you ultimately contributed to the Apache Software Foundation as Apache Falcon.[TRANSITION TO NEXT SLIDE]
  • #14: [SHAUN]Please share the story behind Falcon for the audience.[MOHIT]Discuss what problems you were looking to address with the technology that ultimately became Falcon: specifically how to handle such things as orchestrating data ingest and data processing pipelines, disaster recovery and data retention scenarios, etc.Also share why you decided to contribute the project to Apache. [SHAUN] Everybody, please join me in thanking Mohit for joining us today and sharing his story. It’s amazing to see how companies like InMobi can help accelerate the process of making Hadoop a more enterprise viable data platform.
  • #15: I’ve been in enterprise open source for almost a decade. One thing I’ve learned along the way is that it’s best to think of “Community” in a broad way. In the Hadoop space, there is clearly the open source community. Without the innovative Apache open source technology, none of us would be here today.For really impactful and industry-changing open source technologies, there’s also the end user community. This community spans the tech-savvy early adopter types as well as the more pragmatic and conservative adopter types who want a more “whole solution”. Then the 3rd piece is the broader ecosystem that integrates with, extends, enhances, builds on, etc.One of the reasons I asked Mohit from InMobi to come on stage and share his story is that InMobi is a great example of an End User who is VERY ACTIVE in the open source Community.This room is filled with people across these 3 areas and each of these perspectives is CRITICALLY IMPORTANT if Hadoop is to be all it can be. So my simple ask of you is:GET INVOLVED…in whatever way makes sense for you and your business.
  • #16: The ecosystem plays a critical role in rounding out solution architectures around Apache Hadoop. This slide outlines 3 major layers of the data stack and conveniently lists the Hadoop Summit platinum sponsors. Starting from the bottom, we have Infrastructure and Systems Management. Above that we have Data Management Systems, Data Movement, and Integration solutions. Then at the top, we have Development Tools, Business Tools, and Applications that ride on top. I’d like to thank:Cisco, Microsoft, Kognitio, IBM, Teradata, Datameer, Karmasphere, Platfora, SAS, and Splunk for being platinum sponsors!I also want to thank Yahoo for co-hosting this event with Hortonworks!
  • #17: Now let’s expand the scope to include ALL of the sponsors!I love this slide because it is very BUSY!The cool thing is that we have almost 70 sponsors that provide really nice coverage across all layers of the data stack. This is a great example that the Hadoop market is maturing quite nicely!
  • #18: So I’d like to end my session with a quick summary of where the Hadoop market stands today.Hadoop Wave ONE started in 2006 and did a GREAT job at Web-scale Batch-oriented data processing. A vibrant community and strong enterprise interest propelled Hadoop across the Chasm at the end of 2012.
  • #19: The 2nd wave of Hadoop has started and it will continue to fuel Hadoop on its path through mainstream adoption. Everyone in this room is at the forefront of a movement that will have lasting impact across the industry. As Rob mentioned in his opening remarks, Hadoop has the opportunity to process half the world’s data. There’s still a lot of work to be done.My simple ask of you is: GET INVOLVED…in whatever way makes sense for you and your business.Thank you and have a great conference!