SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real-Time Processing in Hadoop
Hadoop Summit 2015
Ali Bajwa
Partner Solutions Engineer
June 2015
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Agenda
 Introduction & about Hortonworks HDP
 Overview of logistics industry scenario
 Overview of streaming architecture on HDP
 Streaming Demo #1
 Integrating Predictive Analytics in streaming scenarios
 Streaming Demo with Predictive additions
 Q & A
Page 2
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Preface: Enabling Technologies
Page 5
• Problems solved at scale, via fundamentally new approaches…
• Make it possible, even simple, to produce new products/applications that would
have been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras
(from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…
• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables
new real-time advanced analytic applications.
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why did Hadoop emerge?
April 2015
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a
Modern Data Architecture with HDP
Spring 2015
Hortonworks. We do Hadoop.
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 330+ customers (as of year-end 2014)
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Customer Partnerships matter
Driving our innovation through
Apache Software Foundation Projects
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 34 27
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 10 n/a
TOTAL 161 108
Source: Apache Software Foundation. As of 11/7/2014.
Hortonworkers are the architects and
engineers that lead development of open
source Apache Hadoop at the ASF
• Expertise
Uniquely capable to solve the most complex issues &
ensure success with latest features
• Connection
Provide customers & partners direct input into
the community roadmap
• Partnership
We partner with customers with subscription offering.
Our success is predicated on yours.
27
Cloudera: 11
Facebook: 5
LinkedIn: 2
IBM: 2
Others: 23
Yahoo
10
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Technology Partnerships matter
Apache Project Hortonworks
Relationship
Named
Partner
Certified
Solution
Resells
Joint
Engr
Microsoft    
HP    
SAS   
SAP    
IBM   
Pivotal   
Redhat   
Teradata    
Informatica   
Oracle  
It is not just about
packaging and certifying
software…
Our joint engineering
with our partners drives
open source standards
for Apache Hadoop
HDP is
Apache Hadoop
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a Centralized Architecture
Modern Data Architecture
• Unifies data and processing.
• Enables applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a completely open data platform
Hortonworks Data Platform 2.2
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core
enterprise services, for any application and any data.
Completely Open
• HDP incorporates every element
required of an enterprise data
platform: data storage, data access,
governance, security, operations
• All components are developed in
open source and then rigorously
tested, certified, and delivered as an
integrated open source platform that’s
easy to consume and use by the
enterprise and ecosystem.
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
(Hadoop Distributed File System)
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache
Zookeeper
Apache Oozie
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real World Use Case:
Trucking Company
Spring 2015
Hortonworks. We do Hadoop.
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Scenario Overview
.
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a
given route; an event could be:
 'Normal' events: starting / stopping of the
vehicle
 ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
What is Kafka? APACHE KAFKA
 High throughput distributed messaging
system
 Publish-Subscribe semantics but re-
imagined at the implementation level to
operate at speed with big data volumes
 Kafka @LinkedIn:
 800 billion messages per day
 175 terabytes of data written per day
 650 terabytes of data read per day
 Over 13 million messages/2.75GB of data
per second
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Kafka: Anatomy of a Topic
Partition 0 Partition 1 Partition 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
 Partitioning allows topics to
scale beyond a single
machine/node
 Topics can also be replicated,
for high availability.
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Apache Storm
• Distributed, real time, fault tolerant Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
•Tuples
•Streams
•Spouts
•Bolts
•Topology
Page 22
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 23
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Spouts
• What is a Spout?
–Generates or a source of Streams
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 24
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 25
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Page 26
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Key Constructs in Apache HBase
• HBase = Key / Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page 28
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Data Assignment
Page 29
HBase Table
Keys within HBase
Divided among
different RegionServers
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Data Access
• Get
–Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with
a matching rowkey
• Put
–Inserts a new version of a cell.
• Scan
–The whole table, row by row, or a section of that table starting at a particular start key and
ending at a particular end key
• Delete
–It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
–Unique capability in the NoSQL market
Page 30
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Benefits of YARN as the Data Operating System
• The container based model allows for running nearly any workload.
–Enables the centralized architecture.
–No longer is MapReduce the only data processing engine.
–Docker containers managed by YARN. Yes Please!
• Decouples resource scheduling from application lifecycle.
–Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains
–Versus static allocation of “slots” in Hadoop 1.0
Page 33
Yahoo has over 30000 nodes running YARN across over 365PB of data.
They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Apache HDFS – Hadoop Distributed File
System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
Page 35
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Streaming Demo - High Level Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous
Events Table
Hbase
Bolt
HDFS
Bolt
Truck Events
Active
MQ
Monitoring
Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging
(Kafka)
Truck Events Topic
Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Streaming Dashboard
.
Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lab #1: bit.ly/1L3RLMo
Lab #2: bit.ly/1FW7ENl (<-lower case L)
Lab #3: bit.ly/1L3S0ah
Shell cheatsheet: bit.ly/1JN8EsO
Slides: bit.ly/1MtVoIL (<-capital I)
Twitter demo: github.com/abajwa-hw/hdp22-twitter-demo
Custom services: github.com/hortonworks-gallery
webinars: hortonworks.com/partners/learn email: abajwa@
IoT demo: youtube.com/watch?v=FHMMcMYhmNI

More Related Content

What's hot (20)

PDF
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
PPTX
Protecting enterprise Data in Hadoop
DataWorks Summit
 
PDF
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
PPTX
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
 
PPTX
What's new in Ambari
DataWorks Summit
 
PDF
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks
 
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
PPTX
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
PPTX
Internet of things Crash Course Workshop
DataWorks Summit
 
PDF
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Hortonworks
 
PDF
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks
 
PDF
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
PDF
Discover.hdp2.2.h base.final[2]
Hortonworks
 
PDF
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
PDF
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
PPTX
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
PDF
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
Protecting enterprise Data in Hadoop
DataWorks Summit
 
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
 
What's new in Ambari
DataWorks Summit
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
Internet of things Crash Course Workshop
DataWorks Summit
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Hortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
Discover.hdp2.2.h base.final[2]
Hortonworks
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 

Viewers also liked (20)

PPTX
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
PPTX
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
PDF
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
PDF
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
PDF
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
PDF
From Beginners to Experts, Data Wrangling for All
DataWorks Summit
 
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
PDF
large scale collaborative filtering using Apache Giraph
DataWorks Summit
 
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
PDF
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
PPTX
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
PPTX
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
DataWorks Summit
 
PDF
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
PDF
Complex Analytics using Open Source Technologies
DataWorks Summit
 
PPTX
Harnessing Hadoop Distuption: A Telco Case Study
DataWorks Summit
 
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
PPTX
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
DataWorks Summit
 
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
From Beginners to Experts, Data Wrangling for All
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
large scale collaborative filtering using Apache Giraph
DataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
DataWorks Summit
 
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Complex Analytics using Open Source Technologies
DataWorks Summit
 
Harnessing Hadoop Distuption: A Telco Case Study
DataWorks Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
DataWorks Summit
 
Ad

Similar to Internet of Things Crash Course Workshop at Hadoop Summit (20)

PPTX
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
 
PDF
Storm Demo Talk - Denver Apr 2015
Mac Moore
 
PDF
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
 
PPTX
S2DS London 2015 - Hadoop Real World
Sean Roberts
 
PDF
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
PDF
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
PDF
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
PDF
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Hortonworks
 
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
PDF
OSDC 2013 | Introduction into Hadoop by Olivier Renault
NETWAYS
 
PDF
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
PDF
Introduction to Hadoop
POSSCON
 
PPTX
Mrinal devadas, Hortonworks Making Sense Of Big Data
PatrickCrompton
 
PDF
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Hortonworks
 
PDF
Discover hdp 2.2 hdfs - final
Hortonworks
 
PPTX
Don't Let Security Be The 'Elephant in the Room'
Hortonworks
 
PDF
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
PDF
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
 
Storm Demo Talk - Denver Apr 2015
Mac Moore
 
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
 
S2DS London 2015 - Hadoop Real World
Sean Roberts
 
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
OSDC 2013 | Introduction into Hadoop by Olivier Renault
NETWAYS
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Introduction to Hadoop
POSSCON
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
PatrickCrompton
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Hortonworks
 
Discover hdp 2.2 hdfs - final
Hortonworks
 
Don't Let Security Be The 'Elephant in the Room'
Hortonworks
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Digital Circuits, important subject in CS
contactparinay1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 

Internet of Things Crash Course Workshop at Hadoop Summit

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time Processing in Hadoop Hadoop Summit 2015 Ali Bajwa Partner Solutions Engineer June 2015
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Agenda  Introduction & about Hortonworks HDP  Overview of logistics industry scenario  Overview of streaming architecture on HDP  Streaming Demo #1  Integrating Predictive Analytics in streaming scenarios  Streaming Demo with Predictive additions  Q & A Page 2
  • 3. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Preface: Enabling Technologies Page 5 • Problems solved at scale, via fundamentally new approaches… • Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand. • Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras (from smartphones) have enabled Electric cars, quad-copters, VR displays, & more… • Hadoop has similarly led to breakthroughs in big data scale & capability, and enables new real-time advanced analytic applications.
  • 4. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Why did Hadoop emerge? April 2015
  • 5. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Traditional systems under pressure Challenges • Constrains data to app • Can’t manage new data • Costly to Scale Business Value Clickstream Geolocation Web Data Internet of Things Docs, emails Server logs 2012 2.8 Zettabytes 2020 40 Zettabytes LAGGARDS INDUSTRY LEADERS 1 2 New Data ERP CRM SCM New Traditional
  • 6. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Spring 2015 Hortonworks. We do Hadoop.
  • 7. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Customer Momentum • 330+ customers (as of year-end 2014) Hortonworks Data Platform • Completely open multi-tenant platform for any app & any data. • A centralized architecture of consistent enterprise services for resource management, security, operations, and governance. Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1000+ Ecosystem Partners
  • 8. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Customer Partnerships matter Driving our innovation through Apache Software Foundation Projects Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 34 27 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 10 n/a TOTAL 161 108 Source: Apache Software Foundation. As of 11/7/2014. Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF • Expertise Uniquely capable to solve the most complex issues & ensure success with latest features • Connection Provide customers & partners direct input into the community roadmap • Partnership We partner with customers with subscription offering. Our success is predicated on yours. 27 Cloudera: 11 Facebook: 5 LinkedIn: 2 IBM: 2 Others: 23 Yahoo 10
  • 9. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Technology Partnerships matter Apache Project Hortonworks Relationship Named Partner Certified Solution Resells Joint Engr Microsoft     HP     SAS    SAP     IBM    Pivotal    Redhat    Teradata     Informatica    Oracle   It is not just about packaging and certifying software… Our joint engineering with our partners drives open source standards for Apache Hadoop HDP is Apache Hadoop
  • 10. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a Centralized Architecture Modern Data Architecture • Unifies data and processing. • Enables applications to have access to all your enterprise data through an efficient centralized platform • Supported with a centralized approach governance, security and operations • Versatile to handle any applications and datasets no matter the size or type Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES Existing Systems ERP CRM SCM ANALYTICS Data Marts Business Analytics Visualization & Dashboards ANALYTICS Applications Business Analytics Visualization & Dashboards ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN: Data Operating System Interactive Real-TimeBatch Partner ISVBatch BatchMP P EDW
  • 11. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a completely open data platform Hortonworks Data Platform 2.2 Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data. Completely Open • HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations • All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem. YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Apache Falcon ApacheHive Cascading ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm Apache Sqoop Apache Flume Apache Kafka SECURITY Apache Ranger Apache Knox Apache Falcon OPERATIONS Apache Ambari Apache Zookeeper Apache Oozie
  • 12. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real World Use Case: Trucking Company Spring 2015 Hortonworks. We do Hadoop.
  • 13. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Scenario Overview .
  • 14. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Trucking company w/ large fleet of trucks in Midwest A truck generates millions of events for a given route; an event could be:  'Normal' events: starting / stopping of the vehicle  ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance Company uses an application that monitors truck locations and violations from the truck/driver in real-time Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  • 15. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 16. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 17. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services What is Kafka? APACHE KAFKA  High throughput distributed messaging system  Publish-Subscribe semantics but re- imagined at the implementation level to operate at speed with big data volumes  Kafka @LinkedIn:  800 billion messages per day  175 terabytes of data written per day  650 terabytes of data read per day  Over 13 million messages/2.75GB of data per second Kafka Cluster producer producer producer consumer consumer consumer
  • 18. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Kafka: Anatomy of a Topic Partition 0 Partition 1 Partition 2 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 11 11 12 Writes Old New APACHE KAFKA  Partitioning allows topics to scale beyond a single machine/node  Topics can also be replicated, for high availability.
  • 19. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 20. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Apache Storm • Distributed, real time, fault tolerant Stream Processing platform. • Provides processing guarantees. • Key concepts include: •Tuples •Streams •Spouts •Bolts •Topology Page 22
  • 21. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. Page 23 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  • 22. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Spouts • What is a Spout? –Generates or a source of Streams –E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed Page 24
  • 23. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Bolts used in the Use Case: 1. HBaseBolt: persisting and counting in Hbase 2. HDFSBolt: persisting into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the number of illegal driver incidents exceed a given threshhold. Page 25
  • 24. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Page 26 Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream
  • 25. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 26. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Key Constructs in Apache HBase • HBase = Key / Value store • Designed for petabyte scale • Supports low latency reads, writes and updates • Key features – Updateable records – Versioned Records – Distributed across a cluster of machines – Low Latency – Caching • Popular use cases: – User profiles and session state – Object store – Sensor apps Page 28
  • 27. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Data Assignment Page 29 HBase Table Keys within HBase Divided among different RegionServers
  • 28. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Data Access • Get –Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a matching rowkey • Put –Inserts a new version of a cell. • Scan –The whole table, row by row, or a section of that table starting at a particular start key and ending at a particular end key • Delete –It is actually a version of put(Add a new version with put with a deletion marker) • SQL via Apache Phoenix –Unique capability in the NoSQL market Page 30
  • 29. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 30. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 20092006 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing Hadoop w/ MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Silo’d clusters Largely batch system Difficult to integrate MR-279: YARN Hadoop 2 & YARN Interactive Real-TimeBatch Architected & led development of YARN to enable the Modern Data Architecture October 23, 2013
  • 31. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Benefits of YARN as the Data Operating System • The container based model allows for running nearly any workload. –Enables the centralized architecture. –No longer is MapReduce the only data processing engine. –Docker containers managed by YARN. Yes Please! • Decouples resource scheduling from application lifecycle. –Improved scalability and fault tolerence • Dynamically allocated resources, resulting in HUGE utilization gains –Versus static allocation of “slots” in Hadoop 1.0 Page 33 Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
  • 32. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 33. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Apache HDFS – Hadoop Distributed File System • Very large scale distributed file system • 10K nodes, tens of millions files and PBs of data • Supports large files • Designed to run on commodity hardware, assumes hardware failures • Files are replicated to handle hardware failure • Detect failures and recovers from them automatically • Optimized for Large Scale Processing • Data locations are exposed so that the computations can move to where data resides • Data Coherency • Write once and read many times access pattern • Files are broken up in chunks called ‘blocks’ • Blocks are distributed over nodes Page 35
  • 34. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Streaming Demo - High Level Architecture Distributed Storage: HDFS YARN Storm Stream Processing Kakfa Spout HBase Dangerous Events Table Hbase Bolt HDFS Bolt Truck Events Active MQ Monitoring Bolt Web App Truck Streaming Data T(1) T(2) T(N) Inbound Messaging (Kafka) Truck Events Topic
  • 35. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo – Streaming Dashboard .
  • 36. Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lab #1: bit.ly/1L3RLMo Lab #2: bit.ly/1FW7ENl (<-lower case L) Lab #3: bit.ly/1L3S0ah Shell cheatsheet: bit.ly/1JN8EsO Slides: bit.ly/1MtVoIL (<-capital I) Twitter demo: github.com/abajwa-hw/hdp22-twitter-demo Custom services: github.com/hortonworks-gallery webinars: hortonworks.com/partners/learn email: abajwa@ IoT demo: youtube.com/watch?v=FHMMcMYhmNI

Editor's Notes

  • #4: Why do we now have electric cars at prodcution scale, quad copters and drones cheap enough for the home hobbyist, and VR displays being bought by companies like Facebook?
  • #5: Because the technology is there now, thanks to advances made in other industries, solving problems at scale in a big marketplace.
  • #6: At Scale (in this case): $270 bn smartphone mkt in 2014 $120 bn internet advertising (proj 2015)
  • #8: Before we dive into Hadoop and its role within the modern data architecture, let’s set the context for why Hadoop has become important. Existing approaches for data management have become both technically and commercially impractical. Technically - these systems were never designed to store or process vast quantities of data Commercially – the licensing structures with the traditonal approach are no longer feasible. These two challenges combined with rate at which data is being produce predicated a need for a new approach to data systems. If we fast-forward another 3 to 5 years, more than half of the data under management within the enterprise will be from these new data sources.
  • #9: Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
  • #10: Single focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type In the open, partner for success.
  • #11: Everything in the open.
  • #12: Joint deep engineering Microsoft (HD Insight), HP, SAP and Teradata
  • #13: In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo! This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type. Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies. This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
  • #14: Our product, the Hortonworks Data Platform (or HDP for short) is a completely open source, enterprise-grade data platform that’s comprised of dozens of Apache open source projects including Apache Hadoop and YARN at its center.   We have a comprehensive engineering, testing, and certification process that integrates and packages all of these components into a cohesive platform that the enterprise can consume and deploy at scale. And our model enables us to proactively manage new innovations and new open source projects into HDP as they emerge.   To ensure the highest quality, we have a test suite, unique to Hortonworks, that is comprised of 10’s of thousands of system and integration tests that we run at scale on a regular basis including on the world’s largest Hadoop clusters at Yahoo! as part of our co-development relationship.   While our pure-play competitors focus on proprietary components for security, operations, and governance, we invest in new open source projects that address these areas.   For example, earlier in 2014, we acquired a small company called XA Secure that provided a comprehensive security and administration product. We flipped the technology in wholesale into open source as Apache Ranger.   Since our security, operations and governance technologies are open source projects, our partners are able to work with us on those projects to ensure deep integration within our joint solution architectures.
  • #15: Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
  • #18: Elastic Search Flume Sink does exist
  • #19: Elastic Search Flume Sink does exist
  • #20: The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers subscribe to one or more topics A Kafka topic is just a sharded write-ahead log Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week). This allows usage in situations where the consumer of data may need to reload data. It also makes it possible to support space-efficient publish-subscribe as there is a single shared log no matter how many consumers; in traditional messaging systems there is usually a queue per consumer, so adding a consumer doubles your data size. This makes Kafka a good fit for things outside the bounds of normal messaging systems such as acting as a pipeline for offline data systems such as Hadoop. These offline systems may load only at intervals as part of a periodic ETL cycle, or may go down for several hours for maintenance, during which time Kafka is able to buffer even TBs of unconsumed data if needed Replication for HA/fault tolerance is built in Pull based system for consumers instead of pushed base Crude benchmark: Basically, single threaded synchronous messages are 400k per second when using 6 "datanode-ish" servers. This goes up to 2+ MM when using partitions and asynchronous messages. Server specs in the benchmark: Intel Xeon 2.5 GHz processor with six cores Six 7200 RPM SATA drives 32GB of RAM 1Gb Ethernet
  • #21: A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing. Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. http://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
  • #22: Elastic Search Flume Sink does exist
  • #28: Elastic Search Flume Sink does exist
  • #32: Elastic Search Flume Sink does exist
  • #33: This all changed with the introduction of Hadoop 2 and YARN. Introduced in October, 2013 it changed everything.   Introduced in MR-279 by Arun Murthy in 2009, Arun and the team at Hortonworks architected and led it’s development as the core change in Hadoop 2. Our view was that to truly enable Hadoop as a component of a broad data architecture, YARN was the fundamental requirement as it turns Hadoop from a single application data system to a multi application data system. This is foundational to our approach of innovating from the core outwards to build Enterprise Hadoop. With YARN it is now possible to land all data in one cluster and then access it in multiple ways: from batch to interactive to real-time.   Today, YARN, at the core of Hadoop is the center of our focus on innovation in and around Hadoop. It is clearly the enabling technology that has started a transition to a data lake within organizations. Simply stated… Hortonworks Architected & led development of YARN in order to enable the Modern Data Architecture
  • #35: Elastic Search Flume Sink does exist
  • #38: Data is ingested, it’s on the dashboard, and it’s in HDFS.