This document summarizes Dave Latham's experience operating a large HBase cluster for Flurry Analytics. Some key points include:
- Flurry runs two HBase clusters with 1000 slave nodes each storing over 400TB of data across 30 tables and 250k regions.
- They faced bottlenecks with the HMaster, NameNode, and ZooKeeper. The cluster also had too many small regions which caused problems.
- Lessons learned include the need to understand system limitations, monitor performance, have a load-aware design, and know your workload characteristics.
- Future work includes testing HDFS HA, snapshots at large scale, separating workloads, and testing larger region and HDFS block sizes.