SlideShare a Scribd company logo
1 
Big Data Hoopla Simplified – 
Hadoop, MapReduce, NoSQL … 
TDWI Conference – Memphis, TN 
Oct 29, 2014 
© Talend 2014
2 
About the Presenter 
Rajan Kanitkar 
• Senior Solutions Engineer 
• Rajan Kanitkar is a Pre-Sales Consultant with Talend. He 
has been active in the broader Data Integration space for 
the past 15 years and has experience with several leading 
software companies in these areas. His areas of 
specialties at Talend include Data Integration (DI), Big 
Data (BD), Data Quality (DQ) and Master Data 
Management (MDM). 
• Contact: rkanitkar@talend.com 
© Talend 2014
3 
Big Data Ecosystem 
© Talend 2014
4 
Quick Reference – Big Data 
Hadoop: Apache Hadoop is an open-source software framework for storage and large 
scale processing of data-sets on clusters of commodity hardware. 
Hadoop v1.0 - Original version that focused on HDFS and MapReduce. The 
Resource Manager and Job Tracker were one entity. 
Hadoop v2.0 – Sometimes called MapReduce 2 (MRv2). Splits out the Resource 
Manager and job monitoring into two separate daemons. Also called YARN. This new 
architecture allows for other processing engines to be managed/monitored aside from 
just the MapReduce engine. 
© Talend 2014
5 
Quick Reference - Big Data 
• Hadoop: the core project 
• HDFS: the Hadoop Distributed File System 
• MapReduce: the software framework for distributed 
processing of large data sets 
• Hive: a data warehouse infrastructure that provides data 
summarization and a querying language 
• Pig: a high-level data-flow language and execution 
framework for parallel computation 
• HBase: this is the Hadoop database. Use it when you 
need random, realtime read/write access to your Big 
Data 
• And many many more: Sqoop, HCatalog, Zookeeper, 
Oozie, Cassandra, MongoDB, etc. 
© Talend 2014
6 
Hadoop Core – HDFS 
Metadata Operations 
Name Node Client 
Data Node 
© Talend 2014 
Block Block 
Block Block 
Data Node 
Block Block 
Block Block 
Data Node 
Block Block 
Block Block 
Data Node 
Block Block 
Block Block 
Read/Write 
Control 
Replicate
7 
Hadoop Core – MapReduce 
© Talend 2014 
The „Word Count Example“
8 
Quick Reference – Data Services 
HCatalog: a set of interfaces that open up access to Hive's metastore for tools inside and outside of the 
Hadoop grid. Hortonworks donated to Apache. In March 2013, merged with Hive. Enables users with 
different processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the 
cluster. 
Hbase: a non-relational, distributed database modeled after Google’s Big Table. Good at storing sparse 
data. Considered a key-value columnar database. Runs on top of HDFS. Useful for random real-time 
read/write access. 
Hive: a data warehouse infrastructure built on top of Hadoop. Provides data summarization, ad-hoc 
query, and analysis of large datasets. Allows to query data using a SQL-like language called HiveQL 
(HQL). 
Mahout: a library of scalable machine-learning algorithms, implemented on top of Hadoop. Mahout 
supports collaborative filtering, clustering, classification and item set mining. 
Pig: allows you to write complex MapReduce transformations using a Pig Latin scripting language. Pig 
Latin defines a set of transformations such as aggregate, join and sort. Pig translates the Pig Latin script 
into MapReduce so that it can be executed within Hadoop. 
SQOOP: utility for bulk data import/export between HDFS and structured data stores such as relational 
databases. 
© Talend 2014
9 
Quick Reference – Operational Services 
Oozie: Apache workflow scheduler for Hadoop. It allows for coordination between Hadoop jobs. A workflow in 
Oozie is defined in what is called a Directed Acyclical Graph (DAG). 
Zookeeper: a distributed, highly available coordination service. Allows distributed processes to coordinate with 
each other through a shared hierarchical name space of data registers (called znodes). Writing distributed 
applications is hard. It’s hard primarily because of partial failure. ZooKeeper gives you a set of tools to build 
distributed applications that can safely handle partial failures. 
Kerberos : a computer network authentication protocol which provides mutual authentication. The name is based 
on the three- headed dog . The three heads of Kerberos are 1) Key Distribution Center (KDC) 2) the client user 3) 
the server with the desired service to access. The KDC performs two service functions: Authentication (are you who 
you say you are) and the Ticket-Granting (gives you an expiring ticket that give you access to certain resources). A 
Kerberos principal is a unique identity to which Kerberos can assign tickets (like a username). A keytab is a file 
containing pairs of Kerberos principals and encrypted keys (these are derived from the Kerberos password). 
© Talend 2014
10 
MapReduce 2.0, YARN, Storm, Spark 
• Yarn: Ensures predictable performance & QoS for all apps 
• Enables apps to run “IN” Hadoop rather than “ON” 
• Streaming with Apache Storm 
• Mini-Batch and In-Memory with Apache Spark 
© Talend 2014 
Applications Run Natively IN Hadoop 
YARN (Cluster Resource Management) 
HDFS2 (Redundant, Reliable Storage) 
BATCH 
(MapReduce) 
INTERACTIVE 
(Tez) 
STREAMING 
(Storm, Spark) 
GRAPH 
(Giraph) 
NoSQL 
(MongoDB) 
EVENTS 
(Falcon) 
ONLINE 
(HBase) 
OTHER 
(Search) 
Source: Hortonworks
11 
Quick Reference – Hadoop 2.0 Additions 
Storm: distributed realtime computation system. A Storm cluster is similar to a Hadoop cluster. On 
Hadoop you run "MapReduce jobs". On Storm you run "topologies". Jobs and topologies are very 
different -- in that a MapReduce job eventually finishes, but a topology processes messages forever 
(or until you kill it). Storm can run on top of YARN. 
Spark: parallel computing program which can operate over any Hadoop input source: HDFS, 
HBase, Amazon S3, Avro, etc. Holds intermediate results in memory, rather than writing them to 
disk; this drastically reduces query return time. Like Hadoop cluster but supports more than just 
MapReduce. 
Tez: framework which allows for a complex directed-acyclic-graph of tasks for processing data and 
is built atop Apache Hadoop YARN. MapReduce is batch-oriented and unsuited for interactive query. 
Tez allows Hive and Pig to be used to process interactive queries on petabyte scale. Support for 
machine learning. 
© Talend 2014
12 
Apache Spark 
What is Spark? 
• Spark Is An In-Memory Cluster Computing Engine that includes an HDFS 
compatible in-memory file system. 
Hadoop MapReduce 
• Batch processing at scale 
• Storage: Hadoop HDFS 
• Runs on Hadoop 
© Talend 2014 
VS 
Spark 
• Batch, interactive, graph and real-time processing 
• Storage: – Hadoop HDFS, Amazon S3, Cassandra… 
• Runs on many platforms 
• Fast in-memory processing up to 100 x faster than MapReduce (M/R)
13 
Apache Storm 
What Is Storm? 
• Storm Is a Cluster Engine Executing Applications Performing Real-time 
Analysis of Streaming Data in Motion – Enabling the Internet of Things for 
data such as sensor data, aircraft parts data, traffic analysis etc 
Storm 
• Real-time stream processing at scale 
• Storage: None - Data in Motion 
• Runs on Hadoop or on its own cluster 
• Fast in-memory processing 
© Talend 2014 
VS 
Spark 
• Batch, interactive, graph and real-time processing 
• Storage: – Hadoop HDFS, Amazon S3, Cassandra… 
• Runs on many plaforms 
• Fast in-memory processing
14 
Quick Reference – Big Data 
Vendors: The Apache Hadoop eco-system is a collection of many projects. 
Because of the complexities, “for profit” companies have packaged, added, 
enhanced and tried to differentiate one another in the Hadoop world. The main 
players are: 
- Cloudera – CDH – Cloudera Distribution for Hadoop. Current version is CDH 
5.2 (includes YARN) 
- Hortonworks - HDP – Hortonworks Data Platform. Spun out of Yahoo in 
2001. Current version is HDP 2.2 (YARN) 
- MapR – M3 (Community), M5 (Enterprise), M7 (adds NoSQL). Apache 
Hadoop derivative. Uses NFS instead of HDFS. 
- Pivotal - GPHD – Greenplum Hadoop. Spun out of EMC in 2013. Current is 
Pivotal HD 2.0 (YARN) 
© Talend 2014
15 
Quick Reference – NoSQL 
NoSQL: A NoSQL database provides a mechanism for storage and 
retrieval of data that is modeled in means other than the tabular relations 
used in relational databases – document, graph, columnar databases. 
Excellent comparison of NoSQL databases by Kristof Kovacs: 
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis 
Includes a comparison of: 
- Cassandra 
- MongoDB 
- Riak 
- Couchbase 
- … and many more 
© Talend 2014
16 
Quick Reference – NoSQL 
Document Storage: stores documents that encapsulate and encode data in some 
standard format (including XML, YAML, and JSON as well as binary forms like BSON, 
PDF and Microsoft Office documents. Different implementations offer different ways of 
organizing and/or grouping documents. 
Documents are addressed in the database via a unique key that represents that 
document. Big feature is the database will offer an API or query language that will allow 
retrieval of documents based on their contents. 
CouchDB: Apache database that focuses on embracing the web. Uses JSON to store 
data, Javascript as its query language using MapReduce, and HTTP for an API. The 
HTTP API is a differentiator between CouchDB and Couchbase. 
Couchbase: designed to provide key-value or document access. Native JSON support. 
Membase + CouchDB = Couchbase. Couchbase architecture includes auto-sharding, 
memcache and 100% uptime redundancy over CouchDB alone. Couchbase has free 
version but is not open-source. 
MongoDB: JSON/BSON style documents with flexible schemas to store data. A 
“collection” is a grouping of MongoDB documents. Collections do not enforce document 
structures. 
© Talend 2014
17 
Quick Reference – NoSQL 
Column Storage: stores data tables as sections of columns of data rather than as 
rows of data. Good for finding or aggregating on large set of similar data. Column 
storage serializes all data for one column contiguous on disk (so very quick read of a 
column). Organization of your data REALLY matters in columnar storage. No 
restriction on number of columns. One row in relational may be many rows in 
columnar. 
Cassandra: Apache distributed database designed to handle large amounts of data 
across many commodity servers, providing high availability with no single point of 
failure. 
Dynamo: Amazon NoSQL database service. All data stored on solid state drives. Replicated 
across three timezones. Integrated with Amazon EMR and S3. Stores “Items” (collection of key-value 
© Talend 2014 
pairs) given an ID. 
Riak: a distributed fault-tolerant key-value database. HTTP/REST API. Can walk links (similar as 
graph). Best used for single-site scalability, availability and fault tolerance – places where even 
seconds of downtime hurt data collection. Good for point-of-sale or factory control system data 
collection. 
HBase: non-relational data store on top of Hadoop. Think of column as key and data as value. 
Must create the Column family on table create. Look on Advanced tab to create families, then use 
when writing data.
18 
Big Data Integration Landscape 
© Talend 2014
19 
Data-Driven Landscape 
© Talend 2014 
Hadoop & NoSQL 
Data Quality 
Latency & Velocity 
Expanding Data 
Volumes 
Master Data Consistency 
Lack of Talent / Skills 
Siloed Data due to 
SAAS 
No End-2-End meta-data 
visibility
20 
Macro Trends Revolutionizing 
the Integration Market 
© Talend 2014 
20 
The amount of data will grow 
50X from 2010 to 2020 
64% of enterprises surveyed 
indicate that they’re 
deploying or planning Big 
Data projects 
By 2020, 55% of CIOs will 
source all their critical apps 
in the Cloud 
Source: Gartner and Cisco reports
21 
The New Data Integration Economics 
“Big data is what 
happened when the cost 
of keeping information 
became less than the 
cost of throwing it away.” 
– Technology Historian George Dyson 
© Talend 2014 
45x 
savings. $1,000/TB 
for Hadoop vs 
$45,000/TB for 
traditional 
$600B 
revenue shift by 
2020 to companies 
that use big data 
effectively 
6x 
faster ROI using 
big data analytics 
tools vs 
traditional EDW 
600x 
active data. 
Neustar moved 
from storing 1% of 
data for 60 days to 
100% for one year
22 
Existing Infrastructures Under Distress: 
Architecturally and Economically 
Weblogs 
© Talend 2014 
Batch to 
real-time 
Standard 
Reports 
Ad-hoc 
Query Tools 
Data Mining 
MDD/OLAP 
Relational 
Systems/ERP 
Analytical 
Applications 
Data 
explosion 
Need more 
active data 
Legacy Systems 
Transform 
External Data 
Sources 
Metadata 
Data Marts 
(the data warehouse)
23 
Benefits of Hadoop and NoSQL 
NoSQL 
ERP 
DBMS 
/EDW 
© Talend 2014 
Data 
explosion 
Batch to 
Real-Time 
Longer 
active data 
IOT 
NoSQL 
Web 
Logs 
Data Marts 
(the data warehouse) 
Legacy 
Systems 
Standard 
Reports 
Ad-hoc 
Query Tools 
Data 
Mining 
MDD/OLAP 
Analytical 
Applications
24 
Top Big Data Challenges 
© Talend 2014 
Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind 
the Hype - 12 September 2013 - G00255160 
“How To” 
Challenges
25 
Big Data Integration Capabilities 
© Talend 2014
26 
Top Big Data Challenges 
© Talend 2014 
Need Solutions that 
Address these 
Challenges 
Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance 
Behind the Hype - 12 September 2013 - G00255160
27 
Convergence, Big Data & Consumerization 
• Next-gen integration platforms need to be designed 
& architected with big data requirements in mind 
© Talend 2014 
ETL / ELT 
Parallelization 
Processing needs to be 
distributed & flexible 
Big data technologies need 
to be integrated seamlessly 
with existing integration 
investments 
RDBMS
28 
Big Data Integration Landscape 
© Talend 2014
29 
“I may say that this is the greatest 
factor: the way in which the 
expedition is equipped.” 
© Talend 2014 
Roald Amundsen 
race to the south pole, 1911 
Source of Roal Amundsen portrait: 
Norwegian National Library 
© Talend 2014 29
30 
Big Data Integration: Ingest – Transform – Deliver 
© Talend 2014 
iPaaS MDM 
HA Govern 
Security Meta 
Storm Kafka 
CXF Camel 
STANDARD-IZE 
MACHINE 
YARN (Cluster Resource Management) 
HDFS2 (Redundant, Reliable Storage) 
HIVE 
BATCH 
(MapReduce) 
INTERACTIVE 
(Tez) 
STREAMING 
(Storm, Spark) 
GRAPH 
(Giraph) 
NoSQL 
(MongoDB) 
Events 
(Falcon) 
ONLINE 
(HBase) 
OTHER 
(Search) 
TRANSFORM (Data Refinement) 
MAP PROFILE PARSE CLEANSE CDC 
LEARNING 
MATCH 
INGEST 
(Ingestion) 
SQOOP 
FLUME 
HDFS API 
HBase API 
DELIVER 
(as an API) 
Karaf ActiveMQ
31 
Big Data Integration and Processing 
© Talend 2014 
Analytics Dashboard 
LLooaadd ttoo HHDDFFSS 
BIG DATA 
(Integration) 
Federate to 
analytics 
HDFS Map/Reduce 
HADOOP 
Data from Various 
Source Systems 
Hive
32 
Important Objectives 
• Moving from hand-code to code generation – MapReduce, 
Pig, Hive, SQOOP etc. – using a graphical user interface 
• Zero footprint on the Hadoop cluster 
• Same graphical user interface for both standard data 
integration and Big Data integration 
© Talend 2014
33 
Trying to get from this… 
© Talend 2014
34 
“pure Hadoop” and MapReduce 
© Talend 2014 
Visual design in Map Reduce and optimize before 
deploying on Hadoop 
to this…
35 
Native Map/Reduce Jobs 
• Create graphical ETL patterns using native Map/Reduce 
© Talend 2014 
• Reduce the need for big 
data coding skills 
• Zero pre-installation on 
the Hadoop cluster 
• Hadoop is the “engine” 
for data processing
36 
Other Important Objectives 
Enables organizations to leverage existing skills such as 
Java and other open source languages 
A large collaborative community for support 
 A large number of components for data and applications including big data 
and NoSQL 
Works directly on Apache Hadoop API 
 Native support for YARN and Hadoop 2.0 support for better resource 
optimization 
Software created through open standards and development 
processes that eliminates vendor lock-in 
 Scalability, portability and performance come for “free” due to Hadoop 
© Talend 2014 
Page 36
37 
© Talend 2014 
Talend Solution for Big Data Integration
38 
Talend’s Solution 
© Talend 2014
39 
The Value of Talend for Big Data 
Leverage In-house Resources 
© Talend 2014 
- Easy-to-use familiar Eclipse-tools that generate big data code 
- 100% standards-based, open source 
- Lots of examples with a large collaborative community 
Big Data Ready 
- Native support for Hadoop, MapReduce, and NoSQL 
- 800+ connectors to all data sources 
- Built-in data quality, security and governance (Platform for Big Data) 
Lower Costs 
- A predictable and scalable subscription model 
- Based only on users (not CPUs or connectors) 
- Free to download, no runtimes to install on your cluster 
$
40 
Talend’s Value for Big Data 
• New frameworks like Spark and Storm are emerging on 
Hadoop and can run on other platforms 
• Companies want to accelerate big data processing and do 
more sophisticated workloads by exploiting in-memory 
capabilities via Spark and for analyzing real-time data in 
motion via Storm 
• Talend can generate Storm applications to analyze and 
filter data in real-time as well as use source data filtered 
by Storm applications 
• Talend can help customers rapidly exploit new Big Data 
technologies to reduce time to value while insulating them 
from future extensions and advancements 
© Talend 2014
41 
Thank You For Your Participation 
© Talend 2014

More Related Content

PPTX
Etl with talend (big data)
pomishra
 
PPTX
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
PPTX
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
PDF
Talend Open Studio Data Integration
Roberto Marchetto
 
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
PPSX
Intro to Talend Open Studio for Data Integration
Philip Yurchuk
 
PDF
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
PDF
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
Etl with talend (big data)
pomishra
 
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Talend Open Studio Data Integration
Roberto Marchetto
 
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Intro to Talend Open Studio for Data Integration
Philip Yurchuk
 
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 

What's hot (19)

PDF
ETL using Big Data Talend
Edureka!
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PPTX
The DAP - Where YARN, HBase, Kafka and Spark go to Production
DataWorks Summit/Hadoop Summit
 
PPTX
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
PPTX
ETL big data with apache hadoop
Maulik Thaker
 
PDF
Manipulating Data with Talend.
Edureka!
 
PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
PDF
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
PPTX
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
PPTX
What's new in Ambari
DataWorks Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PDF
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
PPTX
YARN Ready: Apache Spark
Hortonworks
 
PPTX
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
PPTX
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
PPTX
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
PPTX
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
ETL using Big Data Talend
Edureka!
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
DataWorks Summit/Hadoop Summit
 
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
ETL big data with apache hadoop
Maulik Thaker
 
Manipulating Data with Talend.
Edureka!
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
What's new in Ambari
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
YARN Ready: Apache Spark
Hortonworks
 
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
Ad

Viewers also liked (12)

PPTX
Essential Tools For Your Big Data Arsenal
MongoDB
 
PDF
TOUG Big Data Challenge and Impact
Toronto-Oracle-Users-Group
 
PPS
Big Data Science: Intro and Benefits
Chandan Rajah
 
PPTX
Simplifying Big Data ETL with Talend
Edureka!
 
PDF
Big data: Bringing competition policy to the digital era – Background note – ...
OECD Directorate for Financial and Enterprise Affairs
 
PPTX
Talend Big Data Capabilities Overview
Rajan Kanitkar
 
PDF
QlikView & Big Data
Mischa van Werkhoven
 
PDF
Open Source ETL using Talend Open Studio
santosluis87
 
PDF
Big Data Industry Insights 2015
Den Reymer
 
KEY
Big Data Trends
David Feinleib
 
PPTX
Big Data and Advanced Analytics
McKinsey on Marketing & Sales
 
PPT
Big Data
NGDATA
 
Essential Tools For Your Big Data Arsenal
MongoDB
 
TOUG Big Data Challenge and Impact
Toronto-Oracle-Users-Group
 
Big Data Science: Intro and Benefits
Chandan Rajah
 
Simplifying Big Data ETL with Talend
Edureka!
 
Big data: Bringing competition policy to the digital era – Background note – ...
OECD Directorate for Financial and Enterprise Affairs
 
Talend Big Data Capabilities Overview
Rajan Kanitkar
 
QlikView & Big Data
Mischa van Werkhoven
 
Open Source ETL using Talend Open Studio
santosluis87
 
Big Data Industry Insights 2015
Den Reymer
 
Big Data Trends
David Feinleib
 
Big Data and Advanced Analytics
McKinsey on Marketing & Sales
 
Big Data
NGDATA
 
Ad

Similar to Big Data Hoopla Simplified - TDWI Memphis 2014 (20)

PPTX
Apache hadoop introduction and architecture
Harikrishnan K
 
PDF
BIGDATA ppts
Krisshhna Daasaarii
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 
PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
PDF
Hadoop Technologies
zahid-mian
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
PPTX
Bw tech hadoop
Mindgrub Technologies
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PDF
DBA to Data Scientist
pasalapudi
 
PPTX
Cloudera Hadoop Distribution
Thisara Pramuditha
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
Hadoop - A big data initiative
Mansi Mehra
 
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
PPTX
Big Data Technology Stack : Nutshell
Khalid Imran
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPTX
hadoop-ecosystem-ppt.pptx
raghavanand36
 
PPTX
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
Apache hadoop introduction and architecture
Harikrishnan K
 
BIGDATA ppts
Krisshhna Daasaarii
 
Introduction To Hadoop Ecosystem
InSemble
 
Hadoop and Big Data: Revealed
Sachin Holla
 
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Hadoop Technologies
zahid-mian
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Bw tech hadoop
Mindgrub Technologies
 
Hadoop Big Data A big picture
J S Jodha
 
DBA to Data Scientist
pasalapudi
 
Cloudera Hadoop Distribution
Thisara Pramuditha
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop - A big data initiative
Mansi Mehra
 
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Big Data Technology Stack : Nutshell
Khalid Imran
 
Big Data and Cloud Computing
Farzad Nozarian
 
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 

Recently uploaded (20)

PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
Immersive experiences: what Pharo users do!
ESUG
 
PDF
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Immersive experiences: what Pharo users do!
ESUG
 
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Presentation about variables and constant.pptx
safalsingh810
 
Exploring AI Agents in Process Industries
amoreira6
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 

Big Data Hoopla Simplified - TDWI Memphis 2014

  • 1. 1 Big Data Hoopla Simplified – Hadoop, MapReduce, NoSQL … TDWI Conference – Memphis, TN Oct 29, 2014 © Talend 2014
  • 2. 2 About the Presenter Rajan Kanitkar • Senior Solutions Engineer • Rajan Kanitkar is a Pre-Sales Consultant with Talend. He has been active in the broader Data Integration space for the past 15 years and has experience with several leading software companies in these areas. His areas of specialties at Talend include Data Integration (DI), Big Data (BD), Data Quality (DQ) and Master Data Management (MDM). • Contact: [email protected] © Talend 2014
  • 3. 3 Big Data Ecosystem © Talend 2014
  • 4. 4 Quick Reference – Big Data Hadoop: Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop v1.0 - Original version that focused on HDFS and MapReduce. The Resource Manager and Job Tracker were one entity. Hadoop v2.0 – Sometimes called MapReduce 2 (MRv2). Splits out the Resource Manager and job monitoring into two separate daemons. Also called YARN. This new architecture allows for other processing engines to be managed/monitored aside from just the MapReduce engine. © Talend 2014
  • 5. 5 Quick Reference - Big Data • Hadoop: the core project • HDFS: the Hadoop Distributed File System • MapReduce: the software framework for distributed processing of large data sets • Hive: a data warehouse infrastructure that provides data summarization and a querying language • Pig: a high-level data-flow language and execution framework for parallel computation • HBase: this is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data • And many many more: Sqoop, HCatalog, Zookeeper, Oozie, Cassandra, MongoDB, etc. © Talend 2014
  • 6. 6 Hadoop Core – HDFS Metadata Operations Name Node Client Data Node © Talend 2014 Block Block Block Block Data Node Block Block Block Block Data Node Block Block Block Block Data Node Block Block Block Block Read/Write Control Replicate
  • 7. 7 Hadoop Core – MapReduce © Talend 2014 The „Word Count Example“
  • 8. 8 Quick Reference – Data Services HCatalog: a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid. Hortonworks donated to Apache. In March 2013, merged with Hive. Enables users with different processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the cluster. Hbase: a non-relational, distributed database modeled after Google’s Big Table. Good at storing sparse data. Considered a key-value columnar database. Runs on top of HDFS. Useful for random real-time read/write access. Hive: a data warehouse infrastructure built on top of Hadoop. Provides data summarization, ad-hoc query, and analysis of large datasets. Allows to query data using a SQL-like language called HiveQL (HQL). Mahout: a library of scalable machine-learning algorithms, implemented on top of Hadoop. Mahout supports collaborative filtering, clustering, classification and item set mining. Pig: allows you to write complex MapReduce transformations using a Pig Latin scripting language. Pig Latin defines a set of transformations such as aggregate, join and sort. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. SQOOP: utility for bulk data import/export between HDFS and structured data stores such as relational databases. © Talend 2014
  • 9. 9 Quick Reference – Operational Services Oozie: Apache workflow scheduler for Hadoop. It allows for coordination between Hadoop jobs. A workflow in Oozie is defined in what is called a Directed Acyclical Graph (DAG). Zookeeper: a distributed, highly available coordination service. Allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (called znodes). Writing distributed applications is hard. It’s hard primarily because of partial failure. ZooKeeper gives you a set of tools to build distributed applications that can safely handle partial failures. Kerberos : a computer network authentication protocol which provides mutual authentication. The name is based on the three- headed dog . The three heads of Kerberos are 1) Key Distribution Center (KDC) 2) the client user 3) the server with the desired service to access. The KDC performs two service functions: Authentication (are you who you say you are) and the Ticket-Granting (gives you an expiring ticket that give you access to certain resources). A Kerberos principal is a unique identity to which Kerberos can assign tickets (like a username). A keytab is a file containing pairs of Kerberos principals and encrypted keys (these are derived from the Kerberos password). © Talend 2014
  • 10. 10 MapReduce 2.0, YARN, Storm, Spark • Yarn: Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON” • Streaming with Apache Storm • Mini-Batch and In-Memory with Apache Spark © Talend 2014 Applications Run Natively IN Hadoop YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) EVENTS (Falcon) ONLINE (HBase) OTHER (Search) Source: Hortonworks
  • 11. 11 Quick Reference – Hadoop 2.0 Additions Storm: distributed realtime computation system. A Storm cluster is similar to a Hadoop cluster. On Hadoop you run "MapReduce jobs". On Storm you run "topologies". Jobs and topologies are very different -- in that a MapReduce job eventually finishes, but a topology processes messages forever (or until you kill it). Storm can run on top of YARN. Spark: parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Holds intermediate results in memory, rather than writing them to disk; this drastically reduces query return time. Like Hadoop cluster but supports more than just MapReduce. Tez: framework which allows for a complex directed-acyclic-graph of tasks for processing data and is built atop Apache Hadoop YARN. MapReduce is batch-oriented and unsuited for interactive query. Tez allows Hive and Pig to be used to process interactive queries on petabyte scale. Support for machine learning. © Talend 2014
  • 12. 12 Apache Spark What is Spark? • Spark Is An In-Memory Cluster Computing Engine that includes an HDFS compatible in-memory file system. Hadoop MapReduce • Batch processing at scale • Storage: Hadoop HDFS • Runs on Hadoop © Talend 2014 VS Spark • Batch, interactive, graph and real-time processing • Storage: – Hadoop HDFS, Amazon S3, Cassandra… • Runs on many platforms • Fast in-memory processing up to 100 x faster than MapReduce (M/R)
  • 13. 13 Apache Storm What Is Storm? • Storm Is a Cluster Engine Executing Applications Performing Real-time Analysis of Streaming Data in Motion – Enabling the Internet of Things for data such as sensor data, aircraft parts data, traffic analysis etc Storm • Real-time stream processing at scale • Storage: None - Data in Motion • Runs on Hadoop or on its own cluster • Fast in-memory processing © Talend 2014 VS Spark • Batch, interactive, graph and real-time processing • Storage: – Hadoop HDFS, Amazon S3, Cassandra… • Runs on many plaforms • Fast in-memory processing
  • 14. 14 Quick Reference – Big Data Vendors: The Apache Hadoop eco-system is a collection of many projects. Because of the complexities, “for profit” companies have packaged, added, enhanced and tried to differentiate one another in the Hadoop world. The main players are: - Cloudera – CDH – Cloudera Distribution for Hadoop. Current version is CDH 5.2 (includes YARN) - Hortonworks - HDP – Hortonworks Data Platform. Spun out of Yahoo in 2001. Current version is HDP 2.2 (YARN) - MapR – M3 (Community), M5 (Enterprise), M7 (adds NoSQL). Apache Hadoop derivative. Uses NFS instead of HDFS. - Pivotal - GPHD – Greenplum Hadoop. Spun out of EMC in 2013. Current is Pivotal HD 2.0 (YARN) © Talend 2014
  • 15. 15 Quick Reference – NoSQL NoSQL: A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases – document, graph, columnar databases. Excellent comparison of NoSQL databases by Kristof Kovacs: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis Includes a comparison of: - Cassandra - MongoDB - Riak - Couchbase - … and many more © Talend 2014
  • 16. 16 Quick Reference – NoSQL Document Storage: stores documents that encapsulate and encode data in some standard format (including XML, YAML, and JSON as well as binary forms like BSON, PDF and Microsoft Office documents. Different implementations offer different ways of organizing and/or grouping documents. Documents are addressed in the database via a unique key that represents that document. Big feature is the database will offer an API or query language that will allow retrieval of documents based on their contents. CouchDB: Apache database that focuses on embracing the web. Uses JSON to store data, Javascript as its query language using MapReduce, and HTTP for an API. The HTTP API is a differentiator between CouchDB and Couchbase. Couchbase: designed to provide key-value or document access. Native JSON support. Membase + CouchDB = Couchbase. Couchbase architecture includes auto-sharding, memcache and 100% uptime redundancy over CouchDB alone. Couchbase has free version but is not open-source. MongoDB: JSON/BSON style documents with flexible schemas to store data. A “collection” is a grouping of MongoDB documents. Collections do not enforce document structures. © Talend 2014
  • 17. 17 Quick Reference – NoSQL Column Storage: stores data tables as sections of columns of data rather than as rows of data. Good for finding or aggregating on large set of similar data. Column storage serializes all data for one column contiguous on disk (so very quick read of a column). Organization of your data REALLY matters in columnar storage. No restriction on number of columns. One row in relational may be many rows in columnar. Cassandra: Apache distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Dynamo: Amazon NoSQL database service. All data stored on solid state drives. Replicated across three timezones. Integrated with Amazon EMR and S3. Stores “Items” (collection of key-value © Talend 2014 pairs) given an ID. Riak: a distributed fault-tolerant key-value database. HTTP/REST API. Can walk links (similar as graph). Best used for single-site scalability, availability and fault tolerance – places where even seconds of downtime hurt data collection. Good for point-of-sale or factory control system data collection. HBase: non-relational data store on top of Hadoop. Think of column as key and data as value. Must create the Column family on table create. Look on Advanced tab to create families, then use when writing data.
  • 18. 18 Big Data Integration Landscape © Talend 2014
  • 19. 19 Data-Driven Landscape © Talend 2014 Hadoop & NoSQL Data Quality Latency & Velocity Expanding Data Volumes Master Data Consistency Lack of Talent / Skills Siloed Data due to SAAS No End-2-End meta-data visibility
  • 20. 20 Macro Trends Revolutionizing the Integration Market © Talend 2014 20 The amount of data will grow 50X from 2010 to 2020 64% of enterprises surveyed indicate that they’re deploying or planning Big Data projects By 2020, 55% of CIOs will source all their critical apps in the Cloud Source: Gartner and Cisco reports
  • 21. 21 The New Data Integration Economics “Big data is what happened when the cost of keeping information became less than the cost of throwing it away.” – Technology Historian George Dyson © Talend 2014 45x savings. $1,000/TB for Hadoop vs $45,000/TB for traditional $600B revenue shift by 2020 to companies that use big data effectively 6x faster ROI using big data analytics tools vs traditional EDW 600x active data. Neustar moved from storing 1% of data for 60 days to 100% for one year
  • 22. 22 Existing Infrastructures Under Distress: Architecturally and Economically Weblogs © Talend 2014 Batch to real-time Standard Reports Ad-hoc Query Tools Data Mining MDD/OLAP Relational Systems/ERP Analytical Applications Data explosion Need more active data Legacy Systems Transform External Data Sources Metadata Data Marts (the data warehouse)
  • 23. 23 Benefits of Hadoop and NoSQL NoSQL ERP DBMS /EDW © Talend 2014 Data explosion Batch to Real-Time Longer active data IOT NoSQL Web Logs Data Marts (the data warehouse) Legacy Systems Standard Reports Ad-hoc Query Tools Data Mining MDD/OLAP Analytical Applications
  • 24. 24 Top Big Data Challenges © Talend 2014 Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind the Hype - 12 September 2013 - G00255160 “How To” Challenges
  • 25. 25 Big Data Integration Capabilities © Talend 2014
  • 26. 26 Top Big Data Challenges © Talend 2014 Need Solutions that Address these Challenges Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind the Hype - 12 September 2013 - G00255160
  • 27. 27 Convergence, Big Data & Consumerization • Next-gen integration platforms need to be designed & architected with big data requirements in mind © Talend 2014 ETL / ELT Parallelization Processing needs to be distributed & flexible Big data technologies need to be integrated seamlessly with existing integration investments RDBMS
  • 28. 28 Big Data Integration Landscape © Talend 2014
  • 29. 29 “I may say that this is the greatest factor: the way in which the expedition is equipped.” © Talend 2014 Roald Amundsen race to the south pole, 1911 Source of Roal Amundsen portrait: Norwegian National Library © Talend 2014 29
  • 30. 30 Big Data Integration: Ingest – Transform – Deliver © Talend 2014 iPaaS MDM HA Govern Security Meta Storm Kafka CXF Camel STANDARD-IZE MACHINE YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) HIVE BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) Events (Falcon) ONLINE (HBase) OTHER (Search) TRANSFORM (Data Refinement) MAP PROFILE PARSE CLEANSE CDC LEARNING MATCH INGEST (Ingestion) SQOOP FLUME HDFS API HBase API DELIVER (as an API) Karaf ActiveMQ
  • 31. 31 Big Data Integration and Processing © Talend 2014 Analytics Dashboard LLooaadd ttoo HHDDFFSS BIG DATA (Integration) Federate to analytics HDFS Map/Reduce HADOOP Data from Various Source Systems Hive
  • 32. 32 Important Objectives • Moving from hand-code to code generation – MapReduce, Pig, Hive, SQOOP etc. – using a graphical user interface • Zero footprint on the Hadoop cluster • Same graphical user interface for both standard data integration and Big Data integration © Talend 2014
  • 33. 33 Trying to get from this… © Talend 2014
  • 34. 34 “pure Hadoop” and MapReduce © Talend 2014 Visual design in Map Reduce and optimize before deploying on Hadoop to this…
  • 35. 35 Native Map/Reduce Jobs • Create graphical ETL patterns using native Map/Reduce © Talend 2014 • Reduce the need for big data coding skills • Zero pre-installation on the Hadoop cluster • Hadoop is the “engine” for data processing
  • 36. 36 Other Important Objectives Enables organizations to leverage existing skills such as Java and other open source languages A large collaborative community for support A large number of components for data and applications including big data and NoSQL Works directly on Apache Hadoop API Native support for YARN and Hadoop 2.0 support for better resource optimization Software created through open standards and development processes that eliminates vendor lock-in Scalability, portability and performance come for “free” due to Hadoop © Talend 2014 Page 36
  • 37. 37 © Talend 2014 Talend Solution for Big Data Integration
  • 38. 38 Talend’s Solution © Talend 2014
  • 39. 39 The Value of Talend for Big Data Leverage In-house Resources © Talend 2014 - Easy-to-use familiar Eclipse-tools that generate big data code - 100% standards-based, open source - Lots of examples with a large collaborative community Big Data Ready - Native support for Hadoop, MapReduce, and NoSQL - 800+ connectors to all data sources - Built-in data quality, security and governance (Platform for Big Data) Lower Costs - A predictable and scalable subscription model - Based only on users (not CPUs or connectors) - Free to download, no runtimes to install on your cluster $
  • 40. 40 Talend’s Value for Big Data • New frameworks like Spark and Storm are emerging on Hadoop and can run on other platforms • Companies want to accelerate big data processing and do more sophisticated workloads by exploiting in-memory capabilities via Spark and for analyzing real-time data in motion via Storm • Talend can generate Storm applications to analyze and filter data in real-time as well as use source data filtered by Storm applications • Talend can help customers rapidly exploit new Big Data technologies to reduce time to value while insulating them from future extensions and advancements © Talend 2014
  • 41. 41 Thank You For Your Participation © Talend 2014