Mapreduce over snapshots

Download as PPTX, PDF

•10 likes•6,382 views

The document proposes using MapReduce jobs to perform scans over HBase snapshots. Snapshots provide immutable data from HBase tables. The MapReduce jobs would bypass region servers and scan snapshot files directly for improved performance. An initial implementation called TableSnapshotInputFormat is described which restores snapshot data and runs scans in parallel across map tasks. The implementation addresses security and performance aspects. An API for client-side scanning of snapshots is also proposed to allow snapshot scans outside of MapReduce.

Technology

MapReduce over
snapshots
HBASE-8369

Enis Soztutar
Enis [at] apache [dot] org
@enissoz

© Hortonworks Inc. 2011

Page 1

About Me
• In the Hadoop space since 2007
• Committer and PMC Member in Apache HBase and Hadoop
• Working at Hortonworks as member of Technical Staff
• Twitter: @enissoz

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 2

Snapshots
• Currently a snapshot is a bunch of reference files together with some
metadata
• A table’ snapshot can contain
– Table descriptor
– List of regions
– References to files in the regions
– References to WALs for regionservers

• Current snapshot impl is flush based
– Forces flush to all regions, so that in-memory data is written to disk

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 3

MR over Snapshots
• Idea is do scan’s on the client side bypassing region servers
• Use snapshots since they are immutable
• Similar to short circuit hdfs reads
• TableSnapshotInputFormat works similar to TableInputFormat

• TableMapReduceUtil methods to configure the job

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 4

Deployment Options
HBase online
• Take snaphot while HBase is running
• Run MR job over the snapshot

HBase offline
• Take snapshot while HBase is running
• Export Snapshot using ExportSnapshot to a different hdfs
• Run MR job over snapshot with or without HBase running

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 5

TableSnapshotInputFormat
• Gets a Scan representing the query
• Restore the snapshot to a temporary directory
• For each region in the snapshot:
– Determine whether the region should be scanned (falls between scan start row and
stop row)
– Create one split per region in the scan range ( # of map tasks)
– Each RecordReader will open the region (Hregion) as in HRegionServer
– An internal RegionScanner is used for running the scan

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 6

API

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 7

Timeline
• Will (hopefully) be committed to trunk next week or so
• Interest in bringing this to 0.94 and 0.96 bases as well
• Will come in HDP-2.1, which will be based on 0.96 line

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 8

Security Aspects
• HBase user owns the files in filesystem
• Snapshot files are also owned by the HBase user
• Mapreduce job should be able to read the files in the snapshot + actual
data files
• HDFS only has posix-like perms based on user/group/other
– User running MR job has to be either the HBase user, or have group perms
– HDFS does not have ACL’s, so there is no easy way to grant read access at
filesystem layer

• Idea: similar to current short circuit impl, we can implement a FD
transfer
– User will submit jobs under her own user credentials
– Ask HBase daemons to open the files, and pass a handler / token

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 9

Performance
ScanTest:
• Scan
: open a scanner, do full table scan
• SnapshotScan : open a client-side scanner, do full table scan
• ScanMR
: parallel full table scan from MR
• SnapshotScanMR : do full table scan
•
•
•
•

8 Region servers, 6 disks each
HBase trunk
Hadoop-2.2 (HDP-2.0.7.0-12)
Load data with IntegrationTestBulkLoad
– Evenly distributed rows, created as bulk loaded hfiles. 3 column families

• # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store)
• Data sizes: 6.6G, 13.2G, 19.8G, 26.4G

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 10

Scan speed

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 11

API
• We do not want to limit snapshot scanning only to MapReduce
• Allow client side scanners over snapshot files

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 12

ResultScanner is main scan API

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 13

API (caution: not final yet)

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 14

To the future and beyond
• HBASE-8691 High-Throughput Streaming Scan API
• Can we bypass regionservers without taking snapshots?
• Bypass memstore data, or stream memstore data, but read directly from
hfiles
• Secure reading from snapshots
• Keep up with the updates at
– https://issues.apache.org/jira/browse/HBASE-8369

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 15

Thanks
Questions?
Enis Söztutar
enis [ at ] apache [dot] org
@enissoz

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 16

More Related Content

PPTX

HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.

PPTX

Hadoop hbase mapreduceFARUK BERKSÖZ

PDF

Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clustermas4share

PPTX

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

PPTX

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

PPTX

Operating and supporting HBase Clustersenissoz

PPTX

Dancing with the elephant h base1_finalasterix_smartplatf

PPTX

HBase state of the unionenissoz

HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.

Hadoop hbase mapreduceFARUK BERKSÖZ

Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clustermas4share

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

Operating and supporting HBase Clustersenissoz

Dancing with the elephant h base1_finalasterix_smartplatf

HBase state of the unionenissoz

What's hot (19)

PPTX

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit

PPTX

Apache HBase: State of the UnionDataWorks Summit/Hadoop Summit

PPTX

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

PDF

Apache Big Data EU 2015 - HBaseNick Dimiduk

PPTX

Meet hbase 2.0enissoz

PPTX

Apache phoenix: Past, Present and Future of SQL over HBAseenissoz

PPTX

Apache phoenixUniversity of Moratuwa

PPTX

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.

PPTX

HBaseCon 2015: HBase and SparkHBaseCon

PDF

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

PPTX

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

PPTX

Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit

PDF

Large-scale Web Apps @ PinterestHBaseCon

PPTX

Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit

PPTX

Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit

PDF

Integration of HIve and HBaseHortonworks

PPTX

HBase Accelerated: In-Memory Flush and CompactionDataWorks Summit/Hadoop Summit

PPTX

Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser

PPTX

LLAP: Sub-Second Analytical Queries in HiveDataWorks Summit/Hadoop Summit

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit

Apache HBase: State of the UnionDataWorks Summit/Hadoop Summit

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

Apache Big Data EU 2015 - HBaseNick Dimiduk

Meet hbase 2.0enissoz

Apache phoenix: Past, Present and Future of SQL over HBAseenissoz

Apache phoenixUniversity of Moratuwa

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.

HBaseCon 2015: HBase and SparkHBaseCon

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit

Large-scale Web Apps @ PinterestHBaseCon

Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit

Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit

Integration of HIve and HBaseHortonworks

HBase Accelerated: In-Memory Flush and CompactionDataWorks Summit/Hadoop Summit

Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser

LLAP: Sub-Second Analytical Queries in HiveDataWorks Summit/Hadoop Summit

Viewers also liked (8)

PPTX

HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon

PDF

Mar 2012 HUG: Hive with HBaseYahoo Developer Network

PPTX

hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit

PDF

HBase Sizing Guidelarsgeorge

PPT

Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group

PDF

HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.

PDF

HBase Storage InternalsDataWorks Summit

PDF

Integration of Hive and HBaseHortonworks

HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon

Mar 2012 HUG: Hive with HBaseYahoo Developer Network

hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit

HBase Sizing Guidelarsgeorge

Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group

HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.

HBase Storage InternalsDataWorks Summit

Integration of Hive and HBaseHortonworks

Similar to Mapreduce over snapshots (20)

PDF

HBase for ArchitectsNick Dimiduk

POTX

Meet HBase 2.0 and Phoenix 5.0Ankit Singhal

PDF

Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise

PPTX

Introduction to Apache HBaseGokuldas Pillai

PPTX

Meet HBase 2.0 and Phoenix-5.0DataWorks Summit

PDF

Hadoop at datasiftJairam Chandar

PPTX

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

PPTX

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

PPTX

2013 year of real-time hadoopGeoff Hendrey

PDF

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.

PPTX

Hbasepreso 111116185419-phpapp02Gokuldas Pillai

PPTX

HDFS- What is New and FutureDataWorks Summit

PPTX

Keynote: The Future of Apache HBaseHBaseCon

PPTX

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

PDF

Hbase status quo apache-con europe - nov 2012Chris Huang

PPTX

HBase in Practice DataWorks Summit/Hadoop Summit

PPT

Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks

PPTX

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

PPTX

HBase in Practicelarsgeorge

PPTX

Apache Hadoop Now Next and BeyondDataWorks Summit

HBase for ArchitectsNick Dimiduk

Meet HBase 2.0 and Phoenix 5.0Ankit Singhal

Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise

Introduction to Apache HBaseGokuldas Pillai

Meet HBase 2.0 and Phoenix-5.0DataWorks Summit

Hadoop at datasiftJairam Chandar

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

2013 year of real-time hadoopGeoff Hendrey

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.

Hbasepreso 111116185419-phpapp02Gokuldas Pillai

HDFS- What is New and FutureDataWorks Summit

Keynote: The Future of Apache HBaseHBaseCon

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

Hbase status quo apache-con europe - nov 2012Chris Huang

HBase in Practice DataWorks Summit/Hadoop Summit

Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

HBase in Practicelarsgeorge

Apache Hadoop Now Next and BeyondDataWorks Summit

Recently uploaded (20)

PDF

The Future of Mobile Is Context-Aware—Are You Ready?iProgrammer Solutions Private Limited

PDF

Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGISSafe Software

PDF

Automating ArcGIS Content Discovery with FME: A Real World Use CaseSafe Software

PDF

A Strategic Analysis of the MVNO Wave in Emerging Markets.pdfIPLOOK Networks

PDF

GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdfLuiz Carneiro

PDF

Economic Impact of Data Centres to the Malaysian Economyflintglobalapac

PPTX

cloud computing vai.pptx for the projectvaibhavdobariyal79

PPTX

OA presentation.pptx OA presentation.pptxpateldhruv002338

PDF

Google I/O Extended 2025 Baku - all pptsHusseinMalikMammadli

PDF

Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdfPrecisely

PDF

Responsible AI and AI Ethics - By Sylvester EbhonuSylvester Ebhonu

PDF

Security features in Dell, HP, and Lenovo PC systems: A research-based compar...Principled Technologies

PDF

AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdfArtjoker Software Development Company

PDF

Presentation about Hardware and Software in Computersnehamodhawadiya

PPTX

The Future of AI & Machine Learning.pptxpritsen4700

PDF

The Future of Artificial Intelligence (AI)Mukul

PDF

OFFOFFBOX™ – A New Era for African Film | Startup Presentationambaicciwalkerbrian

PDF

Research-Fundamentals-and-Topic-Development.pdfayesha butalia

PPTX

IT Runs Better with ThousandEyes AI-driven AssuranceThousandEyes

PPTX

AI and Robotics for Human Well-being.pptxJAYMIN SUTHAR

The Future of Mobile Is Context-Aware—Are You Ready?iProgrammer Solutions Private Limited

Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGISSafe Software

Automating ArcGIS Content Discovery with FME: A Real World Use CaseSafe Software

A Strategic Analysis of the MVNO Wave in Emerging Markets.pdfIPLOOK Networks

GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdfLuiz Carneiro

Economic Impact of Data Centres to the Malaysian Economyflintglobalapac

cloud computing vai.pptx for the projectvaibhavdobariyal79

OA presentation.pptx OA presentation.pptxpateldhruv002338

Google I/O Extended 2025 Baku - all pptsHusseinMalikMammadli

Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdfPrecisely

Responsible AI and AI Ethics - By Sylvester EbhonuSylvester Ebhonu

Security features in Dell, HP, and Lenovo PC systems: A research-based compar...Principled Technologies

AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdfArtjoker Software Development Company

Presentation about Hardware and Software in Computersnehamodhawadiya

The Future of AI & Machine Learning.pptxpritsen4700

The Future of Artificial Intelligence (AI)Mukul

OFFOFFBOX™ – A New Era for African Film | Startup Presentationambaicciwalkerbrian

Research-Fundamentals-and-Topic-Development.pdfayesha butalia

IT Runs Better with ThousandEyes AI-driven AssuranceThousandEyes

AI and Robotics for Human Well-being.pptxJAYMIN SUTHAR

Mapreduce over snapshots

2. About Me • In the Hadoop space since 2007 • Committer and PMC Member in Apache HBase and Hadoop • Working at Hortonworks as member of Technical Staff • Twitter: @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 2

3. Snapshots • Currently a snapshot is a bunch of reference files together with some metadata • A table’ snapshot can contain – Table descriptor – List of regions – References to files in the regions – References to WALs for regionservers • Current snapshot impl is flush based – Forces flush to all regions, so that in-memory data is written to disk Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 3

4. MR over Snapshots • Idea is do scan’s on the client side bypassing region servers • Use snapshots since they are immutable • Similar to short circuit hdfs reads • TableSnapshotInputFormat works similar to TableInputFormat • TableMapReduceUtil methods to configure the job Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 4

5. Deployment Options HBase online • Take snaphot while HBase is running • Run MR job over the snapshot HBase offline • Take snapshot while HBase is running • Export Snapshot using ExportSnapshot to a different hdfs • Run MR job over snapshot with or without HBase running Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 5

6. TableSnapshotInputFormat • Gets a Scan representing the query • Restore the snapshot to a temporary directory • For each region in the snapshot: – Determine whether the region should be scanned (falls between scan start row and stop row) – Create one split per region in the scan range ( # of map tasks) – Each RecordReader will open the region (Hregion) as in HRegionServer – An internal RegionScanner is used for running the scan Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 6

8. Timeline • Will (hopefully) be committed to trunk next week or so • Interest in bringing this to 0.94 and 0.96 bases as well • Will come in HDP-2.1, which will be based on 0.96 line Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 8

9. Security Aspects • HBase user owns the files in filesystem • Snapshot files are also owned by the HBase user • Mapreduce job should be able to read the files in the snapshot + actual data files • HDFS only has posix-like perms based on user/group/other – User running MR job has to be either the HBase user, or have group perms – HDFS does not have ACL’s, so there is no easy way to grant read access at filesystem layer • Idea: similar to current short circuit impl, we can implement a FD transfer – User will submit jobs under her own user credentials – Ask HBase daemons to open the files, and pass a handler / token Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 9

10. Performance ScanTest: • Scan : open a scanner, do full table scan • SnapshotScan : open a client-side scanner, do full table scan • ScanMR : parallel full table scan from MR • SnapshotScanMR : do full table scan • • • • 8 Region servers, 6 disks each HBase trunk Hadoop-2.2 (HDP-2.0.7.0-12) Load data with IntegrationTestBulkLoad – Evenly distributed rows, created as bulk loaded hfiles. 3 column families • # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store) • Data sizes: 6.6G, 13.2G, 19.8G, 26.4G Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 10

12. API • We do not want to limit snapshot scanning only to MapReduce • Allow client side scanners over snapshot files Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 12

15. To the future and beyond • HBASE-8691 High-Throughput Streaming Scan API • Can we bypass regionservers without taking snapshots? • Bypass memstore data, or stream memstore data, but read directly from hfiles • Secure reading from snapshots • Keep up with the updates at – https://issues.apache.org/jira/browse/HBASE-8369 Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 15

16. Thanks Questions? Enis Söztutar enis [ at ] apache [dot] org @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 16