SlideShare a Scribd company logo
The Platform for Big Data
    Amr Awadallah | CTO, Founder, Cloudera, Inc.
    aaa@cloudera.com, twitter: @awadallah




1
The Problems with Current Data Systems

    BI Reports + Interactive Apps             3. Can’t Explore Original High
                                                    Fidelity Raw Data
     RDBMS (aggregated data)

         ETL Compute Grid

                         1. Moving Data To
                       Compute Doesn’t Scale

              Storage Only Grid (original raw data)
                                                                         2. Archiving
              Mostly Append
                                                                         = Premature
                            Collection                                   Data Death

                        Instrumentation


2                          ©2012 Cloudera, Inc. All Rights Reserved.
The Solution: A Combined Storage/Compute Layer

                                                     3. Data Exploration &
      BI Reports + Interactive Apps                   Advanced Analytics
       RDBMS (aggregated data)

                       1. Scalable Throughput
                       For ETL & Aggregation
                          (ETL Acceleration)
                                                                               2. Keep Data
                  Hadoop: Storage + Compute Grid
                                                                              Alive For Ever
                Mostly Append                                                (Active Archive)

                              Collection

                          Instrumentation


3                            ©2012 Cloudera, Inc. All Rights Reserved.
So What is Apache                                            Hadoop ?
•   A scalable fault-tolerant distributed system for data storage and
    processing (open source under the Apache license).

•   Core Hadoop has two main systems:
    • Hadoop Distributed File System: self-healing high-bandwidth clustered
      storage.
    • MapReduce: distributed fault-tolerant resource management and scheduling
      coupled with a scalable data programming abstraction.

•   Key business values:
    • Flexibility – Store any data, Run any analysis.
    • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
    • Economics – Cost per TB at a fraction of traditional options.


4                           ©2012 Cloudera, Inc. All Rights Reserved.
The Hadoop Big Bang



                                                   • Fastest sort of a TB, 62secs
                                                   over 1,460 nodes
                                                   • Sorted a PB in 16.25hours
                                                   over 3,658 nodes




                                                      Hadoop World 2009,
                                                        500 attendees


5      ©2012 Cloudera, Inc. All Rights Reserved.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):                                    Schema-on-Read (Hadoop):
    •   Schema must be created before                   •   Data is simply copied to the file store,
        any data can be loaded.                             no transformation is needed.
    •   An explicit load operation has to               •   A SerDe (Serializer/Deserlizer) is
        take place which transforms data                    applied during read time to extract
        to DB internal serialization format.                the required columns (late binding)
    •   New columns must be added                       •   New data can start flowing anytime
        explicitly before new data for such                 and will appear retroactively once the
        columns can be loaded into the                      SerDe is updated to parse it.
        database.

          •   OLAP is Fast                                     •   Load is Fast
                                                Pros
                                                 Pros
          •   Standards/Governance                             •   Flexibility/Agility


6                                    ©2012 Cloudera, Inc. All Rights Reserved.
Scalability: Scalable Software Development

      Grows without requiring developers to
      re-architect their algorithms/application.


                  AUTO SCALE
                   AUTO SCALE




7                  ©2012 Cloudera, Inc. All Rights Reserved.
Economics: Return on Byte
    •   Return on Byte (ROB) = value to be extracted from that
        byte divided by the cost of storing that byte

    •   If ROB is < 1 then it will be buried into tape wasteland,
        thus we need more economical active storage.

                                                                       High ROB


                                                                       Low ROB



8                          ©2012 Cloudera, Inc. All Rights Reserved.
The Big Data Platform: CDH4 – June 2012

                                   Job Workflow             Data Processing Lib                    Data Mining Lib
                                          APACHE OOZIE                       DataFu for Pig             APACHE MAHOUT
     Build/Test: APACHE BIGTOP




                                   Web Console                  Interactive SQL                       Metadata
                                                 HUE                                     Impala    APACHE HIVE MetaStore


                                                   Batch Processing Languages
                                                                                                           Fast
                                    Data                              APACHE PIG, APACHE HIVE
                                                                                                        Read/Write
                                 Integration
                                                                                                          Access
                                 APACHE FLUME,            Hadoop Core Kernel
                                 APACHE SQOOP                                     MapReduce, HDFS       APACHE HBASE


                                 Cloud Deployment                 Connectivity                      Coordination
                                         APACHE WHIRR          ODBC/JDBC/FUSE/HTTPS                  APACHE ZOOKEEPER

                                     Cloudera Manager Free Edition (Installation Wizard)


9                                                      ©2012 Cloudera, Inc. All Rights Reserved.
CDH in the Enterprise Data Stack
                                       ENGINEERS      DATA SCIENTISTS      ANALYSTS          BUSINESS USERS


       DATA            SYSTEM
     ARCHITECTS       OPERATORS
                                                        Modeling
                                                         Modeling            BI / /
                                                                              BI              Enterprise
                                                                                               Enterprise
                                         IDEs
                                          IDEs           Tools
                                                          Tools            Analytics
                                                                            Analytics         Reporting
                                                                                               Reporting

     Meta Data/
      Meta Data/      Cloudera
                       Cloudera
      ETL Tools
       ETL Tools      Manager
                       Manager


                                        ODBC, JDBC,
                                         NFS, HTTP

                                                                                      Enterprise Data
                                                                        Sqoop           Warehouse


                                                                                   Online Serving
                                                                        Sqoop        Systems

              Flume          Flume        Flume              Sqoop
                                                                                                        CUSTOMERS

                                                      Relational
                                                       Relational                       Web/Mobile
                                                                                         Web/Mobile
          Logs
           Logs           Files
                           Files     Web Data
                                     Web Data         Databases                         Applications
                                                       Databases                         Applications


10
                                     ©2012 Cloudera, Inc. All Rights Reserved.
HBase versus HDFS
HDFS:                                                    HBase:
Optimized For:                                            Optimized For:
•    Large Files                                          •   Small Records
•    Sequential Access (Hi Throughput)                    •   Random Access (Lo Latency)
•    Append Only                                          •   Atomic Record Updates

Use For:                                                  Use For:
•    Fact tables that are mostly append only              •   Dimension tables which are updated
     and require sequential full table scans.
                                                              frequently and require random low-
                                                              latency lookups.

                                                          Not Suitable For:
                                                          •   Low Latency Interactive OLAP.


11
                                    ©2012 Cloudera, Inc. All Rights Reserved.
Use Case Examples

 •   Retail: Price Optimization
 •   Media: Content Targeting
 •   Finance: Fraud Detection
 •   Manufacturing: Diagnostics
 •   Info Services: Satellite Imagery
 •   Agriculture: Seed Optimization
 •   Power: Smart Consumption

12             ©2012 Cloudera, Inc. All Rights Reserved.
Core Benefits of the Platform for Big Data

        1. FLEXIBILITY
        STORE ANY DATA
        RUN ANY ANALYSIS
        KEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA



        2. SCALABILITY
        PROVEN GROWTH TO PBS/1,000s OF NODES
        NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES
        KEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA



        3. ECONOMICS
        COST PER TB AT A FRACTION OF OTHER OPTIONS
        KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE
        POWERING THE DATA BEATS ALGORITHM MOVEMENT



13                        ©2012 Cloudera, Inc. All Rights Reserved.
Thank you!
Amr Awadallah, CTO, Founder, Cloudera, Inc. <aaa@cloudera.com>   @awadallah

More Related Content

ODP
The power of hadoop in cloud computing
Joey Echeverria
 
PDF
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
PPTX
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera, Inc.
 
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
PDF
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
PPTX
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Cloudera, Inc.
 
PPTX
Hadoop World 2011: Mike Olson Keynote Presentation
Cloudera, Inc.
 
The power of hadoop in cloud computing
Joey Echeverria
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera, Inc.
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Cloudera, Inc.
 
Hadoop World 2011: Mike Olson Keynote Presentation
Cloudera, Inc.
 

What's hot (20)

PDF
Hadoop 101
EMC
 
PPTX
Apache Hadoop Now Next and Beyond
DataWorks Summit
 
PDF
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
PPTX
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
PPTX
Azure_Business_Opportunity
Nojan Emad
 
PDF
Cloud computing era
TrendProgContest13
 
PPTX
Cloud Austin Meetup - Hadoop like a champion
Ameet Paranjape
 
PPTX
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Cloudera, Inc.
 
KEY
Hortonworks: Agile Analytics Applications
russell_jurney
 
PPTX
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
PDF
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
iwrigley
 
PDF
Introduction to h base
TrendProgContest13
 
PPTX
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Cloudera, Inc.
 
PDF
Oracle in Database Hadoop
DataWorks Summit
 
PDF
Non-Stop Hadoop for Hortonworks
Hortonworks
 
PDF
hadoop_module6
Gurmukh Singh
 
PPTX
Moving from C#/.NET to Hadoop/MongoDB
MongoDB
 
PPTX
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Hadoop 101
EMC
 
Apache Hadoop Now Next and Beyond
DataWorks Summit
 
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
Azure_Business_Opportunity
Nojan Emad
 
Cloud computing era
TrendProgContest13
 
Cloud Austin Meetup - Hadoop like a champion
Ameet Paranjape
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Cloudera, Inc.
 
Hortonworks: Agile Analytics Applications
russell_jurney
 
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
iwrigley
 
Introduction to h base
TrendProgContest13
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Cloudera, Inc.
 
Oracle in Database Hadoop
DataWorks Summit
 
Non-Stop Hadoop for Hortonworks
Hortonworks
 
hadoop_module6
Gurmukh Singh
 
Moving from C#/.NET to Hadoop/MongoDB
MongoDB
 
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Ad

Viewers also liked (19)

PPTX
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PPT
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PDF
Integration of Hive and HBase
Hortonworks
 
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
PPTX
What is big data?
David Wellman
 
PPTX
Designing an IT Solution
Philippe Julio
 
PPT
Big Data
NGDATA
 
ODP
Hadoop demo ppt
Phil Young
 
PPT
Introduction To Map Reduce
rantav
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Big data ppt
Nasrin Hussain
 
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Hive Quick Start Tutorial
Carl Steinbach
 
Integration of Hive and HBase
Hortonworks
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
What is big data?
David Wellman
 
Designing an IT Solution
Philippe Julio
 
Big Data
NGDATA
 
Hadoop demo ppt
Phil Young
 
Introduction To Map Reduce
rantav
 
Big Data & Hadoop Tutorial
Edureka!
 
Seminar Presentation Hadoop
Varun Narang
 
What is Big Data?
Bernard Marr
 
Big Data Analytics with Hadoop
Philippe Julio
 
Big data ppt
Nasrin Hussain
 
Ad

Similar to Data Science Day New York: The Platform for Big Data (20)

PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
PDF
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
 
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
PDF
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
PPTX
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
PPTX
Amr Awadallah, unSEXY Presentation
500 Startups
 
PPTX
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Cloudera, Inc.
 
PDF
Commonanduniqueusecases 110831113310-phpapp01
eimhee
 
PDF
Common and unique use cases for Apache Hadoop
Brock Noland
 
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
PPTX
Hadoop
thisisnabin
 
PPTX
Hadoop in three use cases
Joey Echeverria
 
PPTX
Get started with hadoop hive hive ql languages
JanBask Training
 
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
PPTX
Hadoop info
Nikita Sure
 
PPTX
Big data and hadoop product page
Janu Jahnavi
 
PPTX
Integrating Hadoop Into the Enterprise
DataWorks Summit
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
Amr Awadallah, unSEXY Presentation
500 Startups
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Cloudera, Inc.
 
Commonanduniqueusecases 110831113310-phpapp01
eimhee
 
Common and unique use cases for Apache Hadoop
Brock Noland
 
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Hadoop
thisisnabin
 
Hadoop in three use cases
Joey Echeverria
 
Get started with hadoop hive hive ql languages
JanBask Training
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Hadoop info
Nikita Sure
 
Big data and hadoop product page
Janu Jahnavi
 
Integrating Hadoop Into the Enterprise
DataWorks Summit
 

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Data Science Day New York: The Platform for Big Data

  • 1. The Platform for Big Data Amr Awadallah | CTO, Founder, Cloudera, Inc. [email protected], twitter: @awadallah 1
  • 2. The Problems with Current Data Systems BI Reports + Interactive Apps 3. Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid 1. Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) 2. Archiving Mostly Append = Premature Collection Data Death Instrumentation 2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. The Solution: A Combined Storage/Compute Layer 3. Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS (aggregated data) 1. Scalable Throughput For ETL & Aggregation (ETL Acceleration) 2. Keep Data Hadoop: Storage + Compute Grid Alive For Ever Mostly Append (Active Archive) Collection Instrumentation 3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main systems: • Hadoop Distributed File System: self-healing high-bandwidth clustered storage. • MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. • Key business values: • Flexibility – Store any data, Run any analysis. • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes. • Economics – Cost per TB at a fraction of traditional options. 4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. The Hadoop Big Bang • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes Hadoop World 2009, 500 attendees 5 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file store, any data can be loaded. no transformation is needed. • An explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms data applied during read time to extract to DB internal serialization format. the required columns (late binding) • New columns must be added • New data can start flowing anytime explicitly before new data for such and will appear retroactively once the columns can be loaded into the SerDe is updated to parse it. database. • OLAP is Fast • Load is Fast Pros Pros • Standards/Governance • Flexibility/Agility 6 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE AUTO SCALE 7 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. Economics: Return on Byte • Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB 8 ©2012 Cloudera, Inc. All Rights Reserved.
  • 9. The Big Data Platform: CDH4 – June 2012 Job Workflow Data Processing Lib Data Mining Lib APACHE OOZIE DataFu for Pig APACHE MAHOUT Build/Test: APACHE BIGTOP Web Console Interactive SQL Metadata HUE Impala APACHE HIVE MetaStore Batch Processing Languages Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, Hadoop Core Kernel APACHE SQOOP MapReduce, HDFS APACHE HBASE Cloud Deployment Connectivity Coordination APACHE WHIRR ODBC/JDBC/FUSE/HTTPS APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard) 9 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. CDH in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA SYSTEM ARCHITECTS OPERATORS Modeling Modeling BI / / BI Enterprise Enterprise IDEs IDEs Tools Tools Analytics Analytics Reporting Reporting Meta Data/ Meta Data/ Cloudera Cloudera ETL Tools ETL Tools Manager Manager ODBC, JDBC, NFS, HTTP Enterprise Data Sqoop Warehouse Online Serving Sqoop Systems Flume Flume Flume Sqoop CUSTOMERS Relational Relational Web/Mobile Web/Mobile Logs Logs Files Files Web Data Web Data Databases Applications Databases Applications 10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. HBase versus HDFS HDFS: HBase: Optimized For: Optimized For: • Large Files • Small Records • Sequential Access (Hi Throughput) • Random Access (Lo Latency) • Append Only • Atomic Record Updates Use For: Use For: • Fact tables that are mostly append only • Dimension tables which are updated and require sequential full table scans. frequently and require random low- latency lookups. Not Suitable For: • Low Latency Interactive OLAP. 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. Use Case Examples • Retail: Price Optimization • Media: Content Targeting • Finance: Fraud Detection • Manufacturing: Diagnostics • Info Services: Satellite Imagery • Agriculture: Seed Optimization • Power: Smart Consumption 12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. Core Benefits of the Platform for Big Data 1. FLEXIBILITY STORE ANY DATA RUN ANY ANALYSIS KEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA 2. SCALABILITY PROVEN GROWTH TO PBS/1,000s OF NODES NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES KEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA 3. ECONOMICS COST PER TB AT A FRACTION OF OTHER OPTIONS KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE POWERING THE DATA BEATS ALGORITHM MOVEMENT 13 ©2012 Cloudera, Inc. All Rights Reserved.
  • 14. Thank you! Amr Awadallah, CTO, Founder, Cloudera, Inc. <[email protected]> @awadallah

Editor's Notes

  • #10: Open Source – 100% Open Source, 100% Apache licensed, 100% Free. Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA. Proven at Scale – Deployed at hundreds of enterprises across many industries. Integrated – All required component versions &amp; dependencies are properly managed. Industry Standard – Existing RDBMS, ETL and BI systems work with it. Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc.