SlideShare a Scribd company logo
  Rethinking Data Warehousing & Analytics Ashish Thusoo, Facebook Data Infrastructure Team
Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 2+TB(compressed) raw data per day in April 2009 4+TB(compressed) raw data per day today
 
Trends Leading to More Data  Free or low cost of user services Realization that more insights are derived from simple algorithms on more data
Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems  does not support trends towards more data Closed and Proprietary Systems Limited Scalability does not support trends  towards more data
Hadoop Advantages Pros Superior in availability/scalability/manageability despite lower single node performance Open system Scalable costs Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python/perl) Need to publish data in well known schemas Solution: HIVE
What is HIVE? A system for managing and querying structured data built on top of Hadoop Components Map-Reduce for execution HDFS for storage Metadata in an RDBMS
Hive: Simplifying Hadoop – New Technology Familiar Interfaces hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoop dfs –cat /tmp/largekey/part*
Hive: Open and Extensible Query your own formats and types with your own Serializer/Deserializers Extend the SQL functionality through User Defined Functions Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script
Hive: Smart Execution Plans for Performance Hash based Aggregations Map-side Joins Predicate Pushdown Partition Pruning Many more to come in the future
Interoperability JDBC and ODBC interfaces available Integrations with some traditional SQL tools with some minor modifications More improvements in future to support interoperability with existing front end tools
Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive  (wiki) http://hadoop.apache.org/hive  (home page) http://svn.apache.org/repos/asf/hadoop/hive  (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19, 0.20 Release 0.4.0 is coming in the next few days Mailing Lists:  hive-{user,dev,commits}@hadoop.apache.org
Data Warehousing @ Facebook using Hive & Hadoop
Data Flow Architecture at Facebook Web Servers Scribe MidTier Filers Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Hive replication
Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4 Gigabit Node = DataNode  + Map-Reduce
Hadoop & Hive Cluster @ Facebook Hadoop/Hive Warehouse – the new generation 4800 cores, Storage capacity of 5.5 PetaBytes 12 TB per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch
Hive & Hadoop Usage @ Facebook Statistics per day: 4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day 80K compute hours per day Hive simplifies Hadoop: New engineers go though a Hive training session ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs
Hive & Hadoop Usage @ Facebook Types of Applications: Reporting  Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement  Microstrategy dashboards Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
Facebook’s contributions… A lot of significant contributions: Hive Hdfs features Scheduler work Etc… Talks by Dhruba Borthakur and Zheng Shao in the development track for more information on these projects
 

More Related Content

What's hot (20)

PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
PPT
Cloud Computing: Hadoop
darugar
 
PPTX
Case study on big data
Khushboo Kumari
 
PPTX
Hadoop for beginners free course ppt
Njain85
 
PPT
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
PDF
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
BMonica1
 
PPTX
ArcGIS and Multi-D: Tools & Roadmap
The HDF-EOS Tools and Information Center
 
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
DOCX
Hadoop online training by certified trainer
sriram0233
 
PDF
Short introduction to ML frameworks on Hadoop
Yuya Takashina
 
PPTX
Available platforms for Big Data 2.0
Petr Novotný
 
PDF
B.MONICA II M.SC COMPUTER SCIENCE
BMonica1
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPTX
BIG DATA HADOOP
Azmat Siddique
 
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PPTX
Introduction to Hadoop at Data-360 Conference
Avkash Chauhan
 
PPTX
Data Center Operating System
Keshav Yadav
 
PPTX
Greenplum-Spark November 2018
KongYew Chan, MBA
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PPT
Hadoop distributions - ecosystem
Jakub Stransky
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Cloud Computing: Hadoop
darugar
 
Case study on big data
Khushboo Kumari
 
Hadoop for beginners free course ppt
Njain85
 
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
BMonica1
 
ArcGIS and Multi-D: Tools & Roadmap
The HDF-EOS Tools and Information Center
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
Hadoop online training by certified trainer
sriram0233
 
Short introduction to ML frameworks on Hadoop
Yuya Takashina
 
Available platforms for Big Data 2.0
Petr Novotný
 
B.MONICA II M.SC COMPUTER SCIENCE
BMonica1
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
BIG DATA HADOOP
Azmat Siddique
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Introduction to Hadoop at Data-360 Conference
Avkash Chauhan
 
Data Center Operating System
Keshav Yadav
 
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
Hadoop distributions - ecosystem
Jakub Stransky
 

Viewers also liked (20)

PPTX
Expand a Data warehouse with Hadoop and Big Data
jdijcks
 
PPTX
How Klout is changing the landscape of social media with Hadoop and BI
Denny Lee
 
PPTX
Aspects of data mart
Osama Hussain Paracha
 
PDF
ADER RRHH PRESENTACIÓN CORPORATIVA
María González Fernández
 
PPTX
Facebook Retrospective - Big data-world-europe-2012
Joydeep Sen Sarma
 
PPTX
Facebook's Approach to Big Data Storage Challenge
DataWorks Summit
 
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
PDF
Data Warehouse Evolution Roadshow
MapR Technologies
 
PDF
Project Voldemort
Fabiano Da Ventura
 
PDF
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
PDF
Storage Infrastructure Behind Facebook Messages
yarapavan
 
PPTX
Creating a Culture of Data @ Facebook - TCCEU13
Andy Kriebel
 
PPTX
Dimensional Modeling
aksrauf
 
PDF
Using the right data model in a data mart
David Walker
 
PPTX
Dimensional Modeling Basic Concept with Example
Sajjad Zaheer
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 
PDF
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
Angelo Failla
 
PPTX
Honey bees and beekeeping project
Nouman Rafique
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Expand a Data warehouse with Hadoop and Big Data
jdijcks
 
How Klout is changing the landscape of social media with Hadoop and BI
Denny Lee
 
Aspects of data mart
Osama Hussain Paracha
 
ADER RRHH PRESENTACIÓN CORPORATIVA
María González Fernández
 
Facebook Retrospective - Big data-world-europe-2012
Joydeep Sen Sarma
 
Facebook's Approach to Big Data Storage Challenge
DataWorks Summit
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
Data Warehouse Evolution Roadshow
MapR Technologies
 
Project Voldemort
Fabiano Da Ventura
 
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Creating a Culture of Data @ Facebook - TCCEU13
Andy Kriebel
 
Dimensional Modeling
aksrauf
 
Using the right data model in a data mart
David Walker
 
Dimensional Modeling Basic Concept with Example
Sajjad Zaheer
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
Angelo Failla
 
Honey bees and beekeeping project
Nouman Rafique
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Ad

Similar to Hw09 Rethinking The Data Warehouse With Hadoop And Hive (20)

PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PPTX
WaterlooHiveTalk
nzhang
 
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
PPT
Nextag talk
Joydeep Sen Sarma
 
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
HariPalani10
 
PPTX
hive architecture and hive components in detail
HariKumar544765
 
PPTX
Hive big-data meetup
Remus Rusanu
 
PPTX
Data infrastructure at Facebook
AhmedDoukh
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PPT
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
PPT
Hive Percona 2009
prasadc
 
PPT
Hive ICDE 2010
ragho
 
PPT
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
PPTX
Apache hive
pradipbajpai68
 
PPTX
Big dataproposal
Qubole
 
PPTX
Apache Hive for modern DBAs
Luis Marques
 
PPTX
Apache Hive
tusharsinghal58
 
PPTX
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Hive @ Hadoop day seattle_2010
nzhang
 
WaterlooHiveTalk
nzhang
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Nextag talk
Joydeep Sen Sarma
 
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
HariPalani10
 
hive architecture and hive components in detail
HariKumar544765
 
Hive big-data meetup
Remus Rusanu
 
Data infrastructure at Facebook
AhmedDoukh
 
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
Hive Percona 2009
prasadc
 
Hive ICDE 2010
ragho
 
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Apache hive
pradipbajpai68
 
Big dataproposal
Qubole
 
Apache Hive for modern DBAs
Luis Marques
 
Apache Hive
tusharsinghal58
 
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 

Hw09 Rethinking The Data Warehouse With Hadoop And Hive

  • 1. Rethinking Data Warehousing & Analytics Ashish Thusoo, Facebook Data Infrastructure Team
  • 2. Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 2+TB(compressed) raw data per day in April 2009 4+TB(compressed) raw data per day today
  • 3.  
  • 4. Trends Leading to More Data Free or low cost of user services Realization that more insights are derived from simple algorithms on more data
  • 5. Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Closed and Proprietary Systems Limited Scalability does not support trends towards more data
  • 6. Hadoop Advantages Pros Superior in availability/scalability/manageability despite lower single node performance Open system Scalable costs Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python/perl) Need to publish data in well known schemas Solution: HIVE
  • 7. What is HIVE? A system for managing and querying structured data built on top of Hadoop Components Map-Reduce for execution HDFS for storage Metadata in an RDBMS
  • 8. Hive: Simplifying Hadoop – New Technology Familiar Interfaces hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
  • 9. Hive: Open and Extensible Query your own formats and types with your own Serializer/Deserializers Extend the SQL functionality through User Defined Functions Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script
  • 10. Hive: Smart Execution Plans for Performance Hash based Aggregations Map-side Joins Predicate Pushdown Partition Pruning Many more to come in the future
  • 11. Interoperability JDBC and ODBC interfaces available Integrations with some traditional SQL tools with some minor modifications More improvements in future to support interoperability with existing front end tools
  • 12. Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive (wiki) http://hadoop.apache.org/hive (home page) http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19, 0.20 Release 0.4.0 is coming in the next few days Mailing Lists: hive-{user,dev,commits}@hadoop.apache.org
  • 13. Data Warehousing @ Facebook using Hive & Hadoop
  • 14. Data Flow Architecture at Facebook Web Servers Scribe MidTier Filers Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Hive replication
  • 15. Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4 Gigabit Node = DataNode + Map-Reduce
  • 16. Hadoop & Hive Cluster @ Facebook Hadoop/Hive Warehouse – the new generation 4800 cores, Storage capacity of 5.5 PetaBytes 12 TB per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch
  • 17. Hive & Hadoop Usage @ Facebook Statistics per day: 4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day 80K compute hours per day Hive simplifies Hadoop: New engineers go though a Hive training session ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs
  • 18. Hive & Hadoop Usage @ Facebook Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy dashboards Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
  • 19. Facebook’s contributions… A lot of significant contributions: Hive Hdfs features Scheduler work Etc… Talks by Dhruba Borthakur and Zheng Shao in the development track for more information on these projects
  • 20.  

Editor's Notes

  • #3: Cost of training people is high – have to reduce cost by making system easy to use.
  • #4: Why Hive? Petabytes of structured data User base familiar with SQL and Python/Perl/PHP Commercial Warehousing Software .. Does not scale, very expensive, inflexible Closed source, not programmable using Python/Perl/PHP Solution: SQL layer on top of scalable storage and map-reduce (Hadoop) Openness: Use any data format, embed any programming language
  • #16: Nomenclature: Core switch and Top of Rack
  • #18: 1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?