SlideShare a Scribd company logo
Hadoop and the Data Warehouse
    Patrick Angeles




1
About Me

    •   Director of Field Engineering at Cloudera
         •   Architect on several dozen Hadoop-based data solutions
             for Cloudera customers
    •   Started with Hadoop in 2008
         •   First Hadoop system processed set-top box log data
    •   Past life
         •   Java EE / Database Architect
         •   Web Data Mining
         •   Cryptography / Public Key Infrastructure



2
What is a Data Warehouse?




3
— The Oracle



4
Database Architecture 1.0




       Products
                                Inventory
       Customers       DB
                                Sales
       Orders




5
Database Architecture 1.0

     •   Dead simple
     •   Tables in 3rd normal form
     •   Reports are SQL queries that join through entity
         relationships and aggregate

                  SELECT   c.gender, p.product_name,
                           sum(o.qty), sum(o.price)
                  FROM     order o, customer c, product p
                  WHERE    o.customer_id = c.id
                   AND     o.product_id = p.id
                   AND     o.day = ’2013-03-21’
                  GROUP BY c.gender, p.product_name ;


6
Database Architecture 1.0

     •   Report queries can become expensive, redundant
     •   Build a layer of abstraction!
     •   Materialize the data to something closer to query
         form.
     •   Create reporting tables
          •   Decide on the reports columns
          •   What query criteria can be parameterized
          •   Periodicity of report generation
          •   Denormalize and aggregate

7
Database Architecture 1.1




                               Inventory
               Customers
                                      Sales
                      Orders
           Products




8
Two Database Workloads

           Transactional     Analytic
              Record facts   Reveal patterns

          Write-optimized    Read-optimized

      Random reads/writes    Sequential reads

       Normalized schema     Denormalized schema



9
Analytical Database (2.0)




              Customers          Inventory

                     Orders             Sales
          Products




10
Analytical Database Architecture

      •   Column oriented storage
           •   Reduces I/O on multi-dimensional tables
           •   Improved compression
           •   Skip columns or row ranges
      •   Massively Parallel Processing
           •   Query planner breaks up a task to be executed on
               multiple hosts
      •   Shared-nothing Architecture
           •   Cluster nodes have independent storage and memory
      •   Slow writes, fast reads

11
Analytical Database




                    TX     Analytical
                    DB        DB




12
Data Transformation




                   TX      Analytical
                   DB         DB




13
Three Ways to Transform Data

      •   Transform Extract Load
           •   Query from transactional tables into target schema
      •   Extract Load Transform
           •   Load data into analytical database, transform and write
               to target schema
           •   No need for additional hardware
      •   Extract Transform Load
           •   Read data from transactional database into a grid
               system, transform, then write to analytical database
           •   Least load on tx and analytical systems

14
Business Intelligence Tools




             TX          Analytical
                                      BI
             DB             DB




15
Business Intelligence Tools

      • Can provide canned reports, dashboards, or
        interactive visualizations
      • Typically leverage common standards (SQL,
        JDBC/ODBC) to access data
      • Requires low-latency (sub second or minute,
        depending on query) response times from database




16
Observations

      • Separate transactional from analytical workloads
      • Use appropriate database implementation
        according to the workload
          •   ‘Traditional’ row-major store for transactional
          •   MPP column-store for analytic
      • Consider a BI tool so you’re not stuck writing
        reports for analysts who don’t know SQL
      • Consider an ETL tool so you’re not stuck writing
        transformations for analysts who don’t know SQL


17
Welcome to the Enterprise




18
Basic Data Warehouse Architecture




             TX                   BI
                        DW
             DB




19
Data Marts


                       Sales




           TX          Mktg    BI
                  DW
           DB




                       Prch




20
Multiple Data Sources

          TX
          DB                  Sales




         Files           DW   Mktg    BI




         other                Prch




21
Operational Data Store

       TX
       DB                          Sales




      Files                        Mktg    BI
                ODS           DW




      other                        Prch




22
Where’s Hadoop?




23
No Hadoop

      TX
      DB                    Sales




      Files                 Mktg    BI
                 ODS   DW




     other                  Prch




24
Adjacent System

       TX
       DB                   Sales




      Files                 Mktg    BI
                       DW



                ODS
      other                 Prch




25
ETL Engine

       TX
       DB              Sales




      Files            Mktg    BI
                  DW




      other            Prch




26
Tiered Data Warehouse

             TX
             DB              Sales




            Files            Mktg    BI




            other            Prch




27
Analytical Query Engine

               TX
               DB




              Files            BI




              other




28
Simple Database Architecture




        Products
                                    Inventory
        Customers       DB          Sales
        Orders




29
The future?




        Products
                    Inventory
        Customers
                    Sales
        Orders




30
http://www.hbasecon.com/
            San Francisco
            June 13, 2013




31
32

More Related Content

PDF
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
PDF
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
PDF
Chapitre 2 hadoop
Mouna Torjmen
 
PPTX
Introduction to Bigdata and NoSQL
Tushar Shende
 
PDF
Operating and Supporting Delta Lake in Production
Databricks
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PDF
Nosql data models
Viet-Trung TRAN
 
PDF
The Best Practice of Integrating Apache Flink with Apache Iceberg.pdf
Zheng Hu
 
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Chapitre 2 hadoop
Mouna Torjmen
 
Introduction to Bigdata and NoSQL
Tushar Shende
 
Operating and Supporting Delta Lake in Production
Databricks
 
Hive User Meeting August 2009 Facebook
ragho
 
Nosql data models
Viet-Trung TRAN
 
The Best Practice of Integrating Apache Flink with Apache Iceberg.pdf
Zheng Hu
 

What's hot (20)

PDF
Big Data Architecture and Design Patterns
John Yeung
 
PDF
BigData_Chp1: Introduction à la Big Data
Lilia Sfaxi
 
PDF
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
PPTX
Apache hive introduction
Mahmood Reza Esmaili Zand
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
 
PDF
Introduction to HBase
Avkash Chauhan
 
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
PPTX
Rdbms
rdbms
 
PPT
Object Oriented Database Management System
Ajay Jha
 
PDF
Thinking Big - Big data: principes et architecture
Lilia Sfaxi
 
PPTX
Apache HBase™
Prashant Gupta
 
PDF
Thinking BIG
Lilia Sfaxi
 
PDF
BigData_Chp2: Hadoop & Map-Reduce
Lilia Sfaxi
 
PDF
Hive partitioning best practices
Nabeel Moidu
 
PPTX
Apache Hive Tutorial
Sandeep Patil
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
PDF
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Big Data Architecture and Design Patterns
John Yeung
 
BigData_Chp1: Introduction à la Big Data
Lilia Sfaxi
 
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Big Data Analytics with Hadoop
Philippe Julio
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
 
Introduction to HBase
Avkash Chauhan
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
Rdbms
rdbms
 
Object Oriented Database Management System
Ajay Jha
 
Thinking Big - Big data: principes et architecture
Lilia Sfaxi
 
Apache HBase™
Prashant Gupta
 
Thinking BIG
Lilia Sfaxi
 
BigData_Chp2: Hadoop & Map-Reduce
Lilia Sfaxi
 
Hive partitioning best practices
Nabeel Moidu
 
Apache Hive Tutorial
Sandeep Patil
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Ad

Viewers also liked (20)

PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
PPTX
Hadoop and Your Data Warehouse
Caserta
 
PDF
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
KEY
Large scale ETL with Hadoop
OReillyStrata
 
PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PDF
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
PPTX
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 
PPTX
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
Cloudera, Inc.
 
PDF
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
PPTX
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PPTX
Breakout: Hadoop and the Operational Data Store
Cloudera, Inc.
 
PDF
Architecting next generation big data platform
hadooparchbook
 
PPTX
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Hyunsik Choi
 
PPTX
Cloudera Sessions - Optimize Your Data Warehouse
Cloudera, Inc.
 
PDF
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
PPT
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Cloudera, Inc.
 
PPTX
Kafka ppt
Raphael Monteiro
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Hadoop and Your Data Warehouse
Caserta
 
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
Large scale ETL with Hadoop
OReillyStrata
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Data warehousing with Hadoop
hadooparchbook
 
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
Cloudera, Inc.
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Breakout: Hadoop and the Operational Data Store
Cloudera, Inc.
 
Architecting next generation big data platform
hadooparchbook
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Hyunsik Choi
 
Cloudera Sessions - Optimize Your Data Warehouse
Cloudera, Inc.
 
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Cloudera, Inc.
 
Kafka ppt
Raphael Monteiro
 
Ad

Similar to Hadoop and Enterprise Data Warehouse (20)

PPTX
From the Big Data keynote at InCSIghts 2012
Anand Deshpande
 
PPT
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 
PPTX
Dbms and it infrastructure
projectandppt
 
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
PDF
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
Inside Analysis
 
PPTX
Anexinet Big Data Solutions
Mark Kromer
 
PDF
2010/10 - Database Architechs - Data Services Summary
Database Architechs
 
PDF
DB2 Web Query whats new
COMMON Europe
 
PDF
Architecting a Data Warehouse: A Case Study
Mark Ginnebaugh
 
PDF
Ten Ways For DBA's To Save Time
Embarcadero Technologies
 
PDF
Ten Ways For DBA's To Save Time
Michael Findling
 
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Treasure Data, Inc.
 
PPTX
Pass bac jd_sm
Joseph D'Antoni
 
PDF
Lançamento ERwin 08/02
Allen Informática
 
PDF
Customer summit - big data (final)
Anand Deshpande
 
PPTX
Teradata Big Data London Seminar
Hortonworks
 
PDF
What Does Big Data Mean and Who Will Win
BigDataCloud
 
PDF
Sql Server2008
Microsoft Iceland
 
PPT
Chapter 1 Fundamental Concepts of Database Management.ppt
ChardaneLabiste
 
From the Big Data keynote at InCSIghts 2012
Anand Deshpande
 
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 
Dbms and it infrastructure
projectandppt
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
Inside Analysis
 
Anexinet Big Data Solutions
Mark Kromer
 
2010/10 - Database Architechs - Data Services Summary
Database Architechs
 
DB2 Web Query whats new
COMMON Europe
 
Architecting a Data Warehouse: A Case Study
Mark Ginnebaugh
 
Ten Ways For DBA's To Save Time
Embarcadero Technologies
 
Ten Ways For DBA's To Save Time
Michael Findling
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Treasure Data, Inc.
 
Pass bac jd_sm
Joseph D'Antoni
 
Lançamento ERwin 08/02
Allen Informática
 
Customer summit - big data (final)
Anand Deshpande
 
Teradata Big Data London Seminar
Hortonworks
 
What Does Big Data Mean and Who Will Win
BigDataCloud
 
Sql Server2008
Microsoft Iceland
 
Chapter 1 Fundamental Concepts of Database Management.ppt
ChardaneLabiste
 

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
The Future of Artificial Intelligence (AI)
Mukul
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Software Development Methodologies in 2025
KodekX
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Doc9.....................................
SofiaCollazos
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 

Hadoop and Enterprise Data Warehouse

  • 1. Hadoop and the Data Warehouse Patrick Angeles 1
  • 2. About Me • Director of Field Engineering at Cloudera • Architect on several dozen Hadoop-based data solutions for Cloudera customers • Started with Hadoop in 2008 • First Hadoop system processed set-top box log data • Past life • Java EE / Database Architect • Web Data Mining • Cryptography / Public Key Infrastructure 2
  • 3. What is a Data Warehouse? 3
  • 5. Database Architecture 1.0 Products Inventory Customers DB Sales Orders 5
  • 6. Database Architecture 1.0 • Dead simple • Tables in 3rd normal form • Reports are SQL queries that join through entity relationships and aggregate SELECT c.gender, p.product_name, sum(o.qty), sum(o.price) FROM order o, customer c, product p WHERE o.customer_id = c.id AND o.product_id = p.id AND o.day = ’2013-03-21’ GROUP BY c.gender, p.product_name ; 6
  • 7. Database Architecture 1.0 • Report queries can become expensive, redundant • Build a layer of abstraction! • Materialize the data to something closer to query form. • Create reporting tables • Decide on the reports columns • What query criteria can be parameterized • Periodicity of report generation • Denormalize and aggregate 7
  • 8. Database Architecture 1.1 Inventory Customers Sales Orders Products 8
  • 9. Two Database Workloads Transactional Analytic Record facts Reveal patterns Write-optimized Read-optimized Random reads/writes Sequential reads Normalized schema Denormalized schema 9
  • 10. Analytical Database (2.0) Customers Inventory Orders Sales Products 10
  • 11. Analytical Database Architecture • Column oriented storage • Reduces I/O on multi-dimensional tables • Improved compression • Skip columns or row ranges • Massively Parallel Processing • Query planner breaks up a task to be executed on multiple hosts • Shared-nothing Architecture • Cluster nodes have independent storage and memory • Slow writes, fast reads 11
  • 12. Analytical Database TX Analytical DB DB 12
  • 13. Data Transformation TX Analytical DB DB 13
  • 14. Three Ways to Transform Data • Transform Extract Load • Query from transactional tables into target schema • Extract Load Transform • Load data into analytical database, transform and write to target schema • No need for additional hardware • Extract Transform Load • Read data from transactional database into a grid system, transform, then write to analytical database • Least load on tx and analytical systems 14
  • 15. Business Intelligence Tools TX Analytical BI DB DB 15
  • 16. Business Intelligence Tools • Can provide canned reports, dashboards, or interactive visualizations • Typically leverage common standards (SQL, JDBC/ODBC) to access data • Requires low-latency (sub second or minute, depending on query) response times from database 16
  • 17. Observations • Separate transactional from analytical workloads • Use appropriate database implementation according to the workload • ‘Traditional’ row-major store for transactional • MPP column-store for analytic • Consider a BI tool so you’re not stuck writing reports for analysts who don’t know SQL • Consider an ETL tool so you’re not stuck writing transformations for analysts who don’t know SQL 17
  • 18. Welcome to the Enterprise 18
  • 19. Basic Data Warehouse Architecture TX BI DW DB 19
  • 20. Data Marts Sales TX Mktg BI DW DB Prch 20
  • 21. Multiple Data Sources TX DB Sales Files DW Mktg BI other Prch 21
  • 22. Operational Data Store TX DB Sales Files Mktg BI ODS DW other Prch 22
  • 24. No Hadoop TX DB Sales Files Mktg BI ODS DW other Prch 24
  • 25. Adjacent System TX DB Sales Files Mktg BI DW ODS other Prch 25
  • 26. ETL Engine TX DB Sales Files Mktg BI DW other Prch 26
  • 27. Tiered Data Warehouse TX DB Sales Files Mktg BI other Prch 27
  • 28. Analytical Query Engine TX DB Files BI other 28
  • 29. Simple Database Architecture Products Inventory Customers DB Sales Orders 29
  • 30. The future? Products Inventory Customers Sales Orders 30
  • 32. 32

Editor's Notes

  • #3: Architected scores of Hadoop-based data solutions
  • #6: Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
  • #9: Turns out separating the transactional vs reporting database brings other benefits
  • #11: I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
  • #13: 2 other major components that haven’t been mentioned
  • #14: I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
  • #15: Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
  • #16: Two things this allows you to do- Use different underlying architectures for each database
  • #17: Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
  • #20: Two things this allows you to do- Use different underlying architectures for each database
  • #21: Data marts designed for specific department needs.Kimball ?
  • #22: Two things this allows you to do- Use different underlying architectures for each database
  • #23: Ralph Kimball – The Data Warehousing ToolkitBill Inmon – Building the Data Warehouse
  • #26: Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
  • #27: Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
  • #28: Store long term dataTransform and load to data marts
  • #29: Store long term dataBI tools can readily query data in Hadoop using Impala
  • #30: Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
  • #31: Support for insert/update semantics?HBase with typed columns