SlideShare a Scribd company logo
MapReduce over
snapshots
HBASE-8369

Enis Soztutar
Enis [at] apache [dot] org
@enissoz

© Hortonworks Inc. 2011

Page 1
About Me
• In the Hadoop space since 2007
• Committer and PMC Member in Apache HBase and Hadoop
• Working at Hortonworks as member of Technical Staff
• Twitter: @enissoz

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 2
Snapshots
• Currently a snapshot is a bunch of reference files together with some
metadata
• A table’ snapshot can contain
– Table descriptor
– List of regions
– References to files in the regions
– References to WALs for regionservers

• Current snapshot impl is flush based
– Forces flush to all regions, so that in-memory data is written to disk

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 3
MR over Snapshots
• Idea is do scan’s on the client side bypassing region servers
• Use snapshots since they are immutable
• Similar to short circuit hdfs reads
• TableSnapshotInputFormat works similar to TableInputFormat

• TableMapReduceUtil methods to configure the job

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 4
Deployment Options
HBase online
• Take snaphot while HBase is running
• Run MR job over the snapshot

HBase offline
• Take snapshot while HBase is running
• Export Snapshot using ExportSnapshot to a different hdfs
• Run MR job over snapshot with or without HBase running

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 5
TableSnapshotInputFormat
• Gets a Scan representing the query
• Restore the snapshot to a temporary directory
• For each region in the snapshot:
– Determine whether the region should be scanned (falls between scan start row and
stop row)
– Create one split per region in the scan range ( # of map tasks)
– Each RecordReader will open the region (Hregion) as in HRegionServer
– An internal RegionScanner is used for running the scan

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 6
API

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 7
Timeline
• Will (hopefully) be committed to trunk next week or so
• Interest in bringing this to 0.94 and 0.96 bases as well
• Will come in HDP-2.1, which will be based on 0.96 line

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 8
Security Aspects
• HBase user owns the files in filesystem
• Snapshot files are also owned by the HBase user
• Mapreduce job should be able to read the files in the snapshot + actual
data files
• HDFS only has posix-like perms based on user/group/other
– User running MR job has to be either the HBase user, or have group perms
– HDFS does not have ACL’s, so there is no easy way to grant read access at
filesystem layer

• Idea: similar to current short circuit impl, we can implement a FD
transfer
– User will submit jobs under her own user credentials
– Ask HBase daemons to open the files, and pass a handler / token

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 9
Performance
ScanTest:
• Scan
: open a scanner, do full table scan
• SnapshotScan : open a client-side scanner, do full table scan
• ScanMR
: parallel full table scan from MR
• SnapshotScanMR : do full table scan
•
•
•
•

8 Region servers, 6 disks each
HBase trunk
Hadoop-2.2 (HDP-2.0.7.0-12)
Load data with IntegrationTestBulkLoad
– Evenly distributed rows, created as bulk loaded hfiles. 3 column families

• # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store)
• Data sizes: 6.6G, 13.2G, 19.8G, 26.4G

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 10
Scan speed

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 11
API
• We do not want to limit snapshot scanning only to MapReduce
• Allow client side scanners over snapshot files

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 12
ResultScanner is main scan API

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 13
API (caution: not final yet)

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 14
To the future and beyond
• HBASE-8691 High-Throughput Streaming Scan API
• Can we bypass regionservers without taking snapshots?
• Bypass memstore data, or stream memstore data, but read directly from
hfiles
• Secure reading from snapshots
• Keep up with the updates at
– https://issues.apache.org/jira/browse/HBASE-8369

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 15
Thanks
Questions?
Enis Söztutar
enis [ at ] apache [dot] org
@enissoz

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 16

More Related Content

PPTX
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
PPTX
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
PDF
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
mas4share
 
PPTX
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
PPTX
Operating and supporting HBase Clusters
enissoz
 
PPTX
Dancing with the elephant h base1_final
asterix_smartplatf
 
PPTX
HBase state of the union
enissoz
 
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
mas4share
 
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
Operating and supporting HBase Clusters
enissoz
 
Dancing with the elephant h base1_final
asterix_smartplatf
 
HBase state of the union
enissoz
 

What's hot (19)

PPTX
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
DataWorks Summit/Hadoop Summit
 
PPTX
Apache HBase: State of the Union
DataWorks Summit/Hadoop Summit
 
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
PDF
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
PPTX
Meet hbase 2.0
enissoz
 
PPTX
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
PPTX
Apache phoenix
University of Moratuwa
 
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
PPTX
HBaseCon 2015: HBase and Spark
HBaseCon
 
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
PPTX
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
PDF
Large-scale Web Apps @ Pinterest
HBaseCon
 
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
PPTX
Batch is Back: Critical for Agile Application Adoption
DataWorks Summit/Hadoop Summit
 
PDF
Integration of HIve and HBase
Hortonworks
 
PPTX
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
DataWorks Summit/Hadoop Summit
 
Apache HBase: State of the Union
DataWorks Summit/Hadoop Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
Meet hbase 2.0
enissoz
 
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Apache phoenix
University of Moratuwa
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
HBaseCon 2015: HBase and Spark
HBaseCon
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Large-scale Web Apps @ Pinterest
HBaseCon
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
Batch is Back: Critical for Agile Application Adoption
DataWorks Summit/Hadoop Summit
 
Integration of HIve and HBase
Hortonworks
 
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Ad

Viewers also liked (8)

PPTX
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon
 
PDF
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
PPTX
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
DataWorks Summit/Hadoop Summit
 
PDF
HBase Sizing Guide
larsgeorge
 
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
PDF
HBaseCon 2013: Integration of Apache Hive and HBase
Cloudera, Inc.
 
PDF
HBase Storage Internals
DataWorks Summit
 
PDF
Integration of Hive and HBase
Hortonworks
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon
 
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
DataWorks Summit/Hadoop Summit
 
HBase Sizing Guide
larsgeorge
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
HBaseCon 2013: Integration of Apache Hive and HBase
Cloudera, Inc.
 
HBase Storage Internals
DataWorks Summit
 
Integration of Hive and HBase
Hortonworks
 
Ad

Similar to Mapreduce over snapshots (20)

PDF
HBase for Architects
Nick Dimiduk
 
POTX
Meet HBase 2.0 and Phoenix 5.0
Ankit Singhal
 
PDF
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
PPTX
Introduction to Apache HBase
Gokuldas Pillai
 
PPTX
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
PDF
Hadoop at datasift
Jairam Chandar
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PPTX
2013 year of real-time hadoop
Geoff Hendrey
 
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PPTX
Hbasepreso 111116185419-phpapp02
Gokuldas Pillai
 
PPTX
HDFS- What is New and Future
DataWorks Summit
 
PPTX
Keynote: The Future of Apache HBase
HBaseCon
 
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
PDF
Hbase status quo apache-con europe - nov 2012
Chris Huang
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
PPTX
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
PPTX
HBase in Practice
larsgeorge
 
PPTX
Apache Hadoop Now Next and Beyond
DataWorks Summit
 
HBase for Architects
Nick Dimiduk
 
Meet HBase 2.0 and Phoenix 5.0
Ankit Singhal
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
Introduction to Apache HBase
Gokuldas Pillai
 
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
Hadoop at datasift
Jairam Chandar
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
2013 year of real-time hadoop
Geoff Hendrey
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Hbasepreso 111116185419-phpapp02
Gokuldas Pillai
 
HDFS- What is New and Future
DataWorks Summit
 
Keynote: The Future of Apache HBase
HBaseCon
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
Hbase status quo apache-con europe - nov 2012
Chris Huang
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
HBase in Practice
larsgeorge
 
Apache Hadoop Now Next and Beyond
DataWorks Summit
 

Recently uploaded (20)

PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
The Future of Artificial Intelligence (AI)
Mukul
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 

Mapreduce over snapshots

  • 1. MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] org @enissoz © Hortonworks Inc. 2011 Page 1
  • 2. About Me • In the Hadoop space since 2007 • Committer and PMC Member in Apache HBase and Hadoop • Working at Hortonworks as member of Technical Staff • Twitter: @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 2
  • 3. Snapshots • Currently a snapshot is a bunch of reference files together with some metadata • A table’ snapshot can contain – Table descriptor – List of regions – References to files in the regions – References to WALs for regionservers • Current snapshot impl is flush based – Forces flush to all regions, so that in-memory data is written to disk Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 3
  • 4. MR over Snapshots • Idea is do scan’s on the client side bypassing region servers • Use snapshots since they are immutable • Similar to short circuit hdfs reads • TableSnapshotInputFormat works similar to TableInputFormat • TableMapReduceUtil methods to configure the job Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 4
  • 5. Deployment Options HBase online • Take snaphot while HBase is running • Run MR job over the snapshot HBase offline • Take snapshot while HBase is running • Export Snapshot using ExportSnapshot to a different hdfs • Run MR job over snapshot with or without HBase running Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 5
  • 6. TableSnapshotInputFormat • Gets a Scan representing the query • Restore the snapshot to a temporary directory • For each region in the snapshot: – Determine whether the region should be scanned (falls between scan start row and stop row) – Create one split per region in the scan range ( # of map tasks) – Each RecordReader will open the region (Hregion) as in HRegionServer – An internal RegionScanner is used for running the scan Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 6
  • 7. API Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 7
  • 8. Timeline • Will (hopefully) be committed to trunk next week or so • Interest in bringing this to 0.94 and 0.96 bases as well • Will come in HDP-2.1, which will be based on 0.96 line Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 8
  • 9. Security Aspects • HBase user owns the files in filesystem • Snapshot files are also owned by the HBase user • Mapreduce job should be able to read the files in the snapshot + actual data files • HDFS only has posix-like perms based on user/group/other – User running MR job has to be either the HBase user, or have group perms – HDFS does not have ACL’s, so there is no easy way to grant read access at filesystem layer • Idea: similar to current short circuit impl, we can implement a FD transfer – User will submit jobs under her own user credentials – Ask HBase daemons to open the files, and pass a handler / token Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 9
  • 10. Performance ScanTest: • Scan : open a scanner, do full table scan • SnapshotScan : open a client-side scanner, do full table scan • ScanMR : parallel full table scan from MR • SnapshotScanMR : do full table scan • • • • 8 Region servers, 6 disks each HBase trunk Hadoop-2.2 (HDP-2.0.7.0-12) Load data with IntegrationTestBulkLoad – Evenly distributed rows, created as bulk loaded hfiles. 3 column families • # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store) • Data sizes: 6.6G, 13.2G, 19.8G, 26.4G Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 10
  • 11. Scan speed Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 11
  • 12. API • We do not want to limit snapshot scanning only to MapReduce • Allow client side scanners over snapshot files Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 12
  • 13. ResultScanner is main scan API Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 13
  • 14. API (caution: not final yet) Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 14
  • 15. To the future and beyond • HBASE-8691 High-Throughput Streaming Scan API • Can we bypass regionservers without taking snapshots? • Bypass memstore data, or stream memstore data, but read directly from hfiles • Secure reading from snapshots • Keep up with the updates at – https://issues.apache.org/jira/browse/HBASE-8369 Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 15
  • 16. Thanks Questions? Enis Söztutar enis [ at ] apache [dot] org @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 16