SlideShare a Scribd company logo
Jesse Yates
 Salesforce.com




                  Secondary Indexing

                   the discussion so far….




9/11/12                                      HBase Pow-wow
What is it?
Problem
• HBase rows are multi-dimensional
  – Only sorted on the row key


• How do you efficiently lookup deeper into the
  row key?
Example
 Row        Family       Qualifier   Timestamp   value
 1          Name         First       0           Babe
 1          Name         Last        0           Ruth




How do we find all people with the last name ‘Ruth’?


                     Full table scan!
Indexing!
Row       Family    Qualifier   Timestamp   Value
Ruth      Name      Last        0           1




  Store the property we need to search
  for as the primary key
  • pointer back to the primary row
  • fast lookup - O(lg(n))
Use Cases
• Point lookups
  – Volume of data influences usefulness of index
     • Let user decide if they need to use an index


• Scan lookup
  – WHERE age > 16
Implementations
Omid

Full transactional support
    Centralized oracle
Lily

WAL implementation on top of HBase
        100-500 writes/sec
Percolator

       Full transactions
Distributed, optimistic locking
  ~10 sec latencies possible
Culvert

         Async
Dead project, incomplete
http://jyates.github.com/2012/07/0
  9/consistent-enough-secondary-
            indexes.html
       Client-side coordinated index
       Use timestamps to coordinate
           Not yet implemented
Trend Micro Implementation

          Still just POC
                 ???
Solr/Lucene

Standard Lucene library bolted on HBase
           Not commonly used
 Lots of formats/codecs already written
Considerations for HBase

    What do we need to do?
Built-in vs.
     external library vs.
semi-supported (e.g. security)
Which should I use??
•   HBase experts write a single ‘right’ impl
•   Officially endorse a ‘correct’ version
•   What changes do we need to make
•   How close to the core is the project
    – Written in everywhere
    – hbase-index module
    – External library
Async vs.
Synchronous vs.
 Transactional
Key Observation
“Secondary indexing is inherently an easier
  problem than full transactions… secondary
  index updates are idempotent.”

        - Lars Hofhansl
Async vs. Synchronous vs.Transactional

• We don’t need full transactions
  – Transactions are slow
  – Transactions fail with increasing probability as
    number of servers increases
• Optionally async or sync
  – Async
     • Inherently ‘dirty’ index
• How does index cleanup work?
  – Inherently different for each type
Locality
Where’s my data?
• Extra columns vs. index table
• HBase Region-pinning
  –   Has to be best-effort or will decrease availability
  –   Helps minimize RPC overhead
  –   Cross-table region-pinning
  –   Needs a coprocessor hook to be useful


• HDFS block allocation
  – Keep index and data blocks on same HDFS node
Index Cardinality
How much data are we talking?
“Seems like there are 3 categories of sparseness:
1. sparse indexes (like ipAddress) where a per-table
   approach is more efficient for reads

1. dense indexes (like eventType) where there are likely
   values of every index key on each region

1. very dense indexes (like male/female) where you
   should just be doing a table scan anyway”

                      - Matt Corgan (9/10/12)
Impact on implementation
• Need a lot of knowledge of data to pick the
  right kind of index
  – User knows their data, let them do the hard work
    of picking indexes
Pluggability
Everyone’s got an impl already
• We need to make HBase flexible enough to
  support (most) current indexing formats with
  minimal overhead for switching
  – Lucene style Codec/CodecProvider?
Client-interface
What should it look like?
• Minimal changes to the top-level interfaces
  – Add a single new flag?
  – Configuration based?
• Enough that the user gets to be smart about
  what should be used
  – We can’t get all cases right – just provide building
    blocks
• Automatically use an index?
• Scanner/Filter style use?
Properties for the client
• Should the user even see the index lookups?

• ACID?
• Ordering of results?
  – Support the current sorted order?
  – Batch lookup?

• Implications on current features
  – Replication
  – splitting
Schema(less)
• Schema enforced?
  – Rigid usage of index matching an expected schema?
  – Schema table? Reserved schema columns?.META.?
• Schema-less
  – Let the user apply whatever they think and use only
    what actually works
• Best-effort
  – Use client-hinted schema and try to apply all the
    known indexes
My random thoughts….
• Client-side managed indexes are efficient
  – Minimal RPC overhead
     • Cleanup is async to client and rarely misses
  – Solves the cross-region/server problem
     • Region-pinning is a nice-to-have optimization
  – Scales without concern for locality
  – Flexible enough to support custom codecs
  – Can be built to provide server-side optimizations
     • Locality aware indexes to minimize RPCs
Discussion!

More Related Content

What's hot (20)

PDF
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
PDF
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
PPTX
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Caserta
 
PPTX
Apache Con 2021 Structured Data Streaming
Shivji Kumar Jha
 
PPTX
Sql over hadoop ver 3
Sudheesh Narayanan
 
PDF
Riak at shareaholic
freerobby
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PPTX
Cloudera Hadoop Distribution
Thisara Pramuditha
 
PDF
Apachecon Europe 2012: Operating HBase - Things you need to know
Christian Gügi
 
PPTX
Apache drill
MapR Technologies
 
PPT
Rolling With Riak
John Lynch
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
PPTX
Practical Cross-Dataset Queries with SPARQL (Introduction)
Richard Cyganiak
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
Funtional Programming
Girish Khanzode
 
PDF
Rails on HBase
EffectiveUI
 
PDF
Liferay and Big Data
Miguel Pastor
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Caserta
 
Apache Con 2021 Structured Data Streaming
Shivji Kumar Jha
 
Sql over hadoop ver 3
Sudheesh Narayanan
 
Riak at shareaholic
freerobby
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Cloudera Hadoop Distribution
Thisara Pramuditha
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Christian Gügi
 
Apache drill
MapR Technologies
 
Rolling With Riak
John Lynch
 
Apache Spark Core
Girish Khanzode
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Practical Cross-Dataset Queries with SPARQL (Introduction)
Richard Cyganiak
 
Introduction to Hadoop and Big Data
Joe Alex
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Funtional Programming
Girish Khanzode
 
Rails on HBase
EffectiveUI
 
Liferay and Big Data
Miguel Pastor
 

Similar to Musings on Secondary Indexing in HBase (20)

PPTX
Apache Drill
Ted Dunning
 
PDF
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Lucidworks
 
PPTX
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Kyle Banerjee
 
PPT
NoSQL_Night
Clarence J M Tauro
 
PPTX
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
PPTX
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
PPTX
Apache hive
pradipbajpai68
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PDF
Oracle Week 2016 - Modern Data Architecture
Arthur Gimpel
 
PPTX
Big data Hadoop
Ayyappan Paramesh
 
PPTX
Incredible Impala
Gwen (Chen) Shapira
 
PPTX
No sql and sql - open analytics summit
Open Analytics
 
PPTX
NoSql - mayank singh
Mayank Singh
 
PPTX
La big datacamp2014_vikram_dixit
Data Con LA
 
PDF
Technologies for Data Analytics Platform
N Masahiro
 
PPTX
Got documents?
Maggie Pint
 
PDF
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Bob Pusateri
 
PPTX
HBase in Practice
larsgeorge
 
PDF
From 0 to syncing
Philipp Fehre
 
Apache Drill
Ted Dunning
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Lucidworks
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Kyle Banerjee
 
NoSQL_Night
Clarence J M Tauro
 
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
Apache hive
pradipbajpai68
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Oracle Week 2016 - Modern Data Architecture
Arthur Gimpel
 
Big data Hadoop
Ayyappan Paramesh
 
Incredible Impala
Gwen (Chen) Shapira
 
No sql and sql - open analytics summit
Open Analytics
 
NoSql - mayank singh
Mayank Singh
 
La big datacamp2014_vikram_dixit
Data Con LA
 
Technologies for Data Analytics Platform
N Masahiro
 
Got documents?
Maggie Pint
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Bob Pusateri
 
HBase in Practice
larsgeorge
 
From 0 to syncing
Philipp Fehre
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Ad

Musings on Secondary Indexing in HBase

  • 1. Jesse Yates Salesforce.com Secondary Indexing the discussion so far…. 9/11/12 HBase Pow-wow
  • 3. Problem • HBase rows are multi-dimensional – Only sorted on the row key • How do you efficiently lookup deeper into the row key?
  • 4. Example Row Family Qualifier Timestamp value 1 Name First 0 Babe 1 Name Last 0 Ruth How do we find all people with the last name ‘Ruth’? Full table scan!
  • 5. Indexing! Row Family Qualifier Timestamp Value Ruth Name Last 0 1 Store the property we need to search for as the primary key • pointer back to the primary row • fast lookup - O(lg(n))
  • 6. Use Cases • Point lookups – Volume of data influences usefulness of index • Let user decide if they need to use an index • Scan lookup – WHERE age > 16
  • 8. Omid Full transactional support Centralized oracle
  • 9. Lily WAL implementation on top of HBase 100-500 writes/sec
  • 10. Percolator Full transactions Distributed, optimistic locking ~10 sec latencies possible
  • 11. Culvert Async Dead project, incomplete
  • 12. http://jyates.github.com/2012/07/0 9/consistent-enough-secondary- indexes.html Client-side coordinated index Use timestamps to coordinate Not yet implemented
  • 13. Trend Micro Implementation Still just POC ???
  • 14. Solr/Lucene Standard Lucene library bolted on HBase Not commonly used Lots of formats/codecs already written
  • 15. Considerations for HBase What do we need to do?
  • 16. Built-in vs. external library vs. semi-supported (e.g. security)
  • 17. Which should I use?? • HBase experts write a single ‘right’ impl • Officially endorse a ‘correct’ version • What changes do we need to make • How close to the core is the project – Written in everywhere – hbase-index module – External library
  • 18. Async vs. Synchronous vs. Transactional
  • 19. Key Observation “Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.” - Lars Hofhansl
  • 20. Async vs. Synchronous vs.Transactional • We don’t need full transactions – Transactions are slow – Transactions fail with increasing probability as number of servers increases • Optionally async or sync – Async • Inherently ‘dirty’ index • How does index cleanup work? – Inherently different for each type
  • 22. Where’s my data? • Extra columns vs. index table • HBase Region-pinning – Has to be best-effort or will decrease availability – Helps minimize RPC overhead – Cross-table region-pinning – Needs a coprocessor hook to be useful • HDFS block allocation – Keep index and data blocks on same HDFS node
  • 24. How much data are we talking? “Seems like there are 3 categories of sparseness: 1. sparse indexes (like ipAddress) where a per-table approach is more efficient for reads 1. dense indexes (like eventType) where there are likely values of every index key on each region 1. very dense indexes (like male/female) where you should just be doing a table scan anyway” - Matt Corgan (9/10/12)
  • 25. Impact on implementation • Need a lot of knowledge of data to pick the right kind of index – User knows their data, let them do the hard work of picking indexes
  • 27. Everyone’s got an impl already • We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching – Lucene style Codec/CodecProvider?
  • 29. What should it look like? • Minimal changes to the top-level interfaces – Add a single new flag? – Configuration based? • Enough that the user gets to be smart about what should be used – We can’t get all cases right – just provide building blocks • Automatically use an index? • Scanner/Filter style use?
  • 30. Properties for the client • Should the user even see the index lookups? • ACID? • Ordering of results? – Support the current sorted order? – Batch lookup? • Implications on current features – Replication – splitting
  • 31. Schema(less) • Schema enforced? – Rigid usage of index matching an expected schema? – Schema table? Reserved schema columns?.META.? • Schema-less – Let the user apply whatever they think and use only what actually works • Best-effort – Use client-hinted schema and try to apply all the known indexes
  • 32. My random thoughts…. • Client-side managed indexes are efficient – Minimal RPC overhead • Cleanup is async to client and rarely misses – Solves the cross-region/server problem • Region-pinning is a nice-to-have optimization – Scales without concern for locality – Flexible enough to support custom codecs – Can be built to provide server-side optimizations • Locality aware indexes to minimize RPCs