https://zedware.github.io
Management of data is the core task of storage and database systems. Furthermore, less
attention is on metadata, esp. in the traditional systems. Even in the design
of parallel or distributed systems, architectures often assume that metadata can fit in
the memory of a single node. Maybe Hadoop is the most famous system. LinkedIn is
representative of the adoption of a large-scale Hadoop system. Evidence can be
found in its blogs, such as THE ENDLESS PURSUIT OF SCALE AT LINKEDIN1.
“The largest Hadoop cluster at LinkedIn — which is probably the largest one in
the world, certainly at a commercial entity if not a security agency for a major
nation — has 10,000 nodes and weighed in at 500PB of storage capacity in 2020”.
“The single HDFS NameNode for this monster can serve remote procedure calls with
an average latency of under ten milliseconds, which is pretty good considering
that this HDFS cluster has over 1.6 billion objects (that metric counts
directories, files, and blocks together).” Yes, the cluster has 10,000 nodes,
500 PB of storage and 1.6 billion objects. And the single NameNode consumes
about 380 GB of memory. How did LinkedIn go through the journey2?
“HDFS started running out of gas in 2016, and LinkedIn created Dynamometer to
measure HDFS NameSpace performance and to see how it will perform as more load
is applied to an actual system3.” “YARN started running into
trouble in 2019 as LinkedIn started pushing scale hard, and the company created
DynoYARN to simulate load on YARN clusters at any size and with any workload.
This DynoYARN modeler has been open-sourced as of 4, too.” It is
reported that LinkedIn stores 1 exabyte of data across all Hadoop clusters,
while the largest 10,000-node cluster stores 500 PB of data. QJM is used as
the Journal Service. The optimizations include Java tuning, such as the ration
of Young and Tenured generation, global read-write lock in non-fair mode,
Dynamometer to test the workload before deployment. File less than 512MB is
treated as small, logging directory contains many small files (less than 1MB
on average), so moved to a dedicated cluster with the help of a homebrew fs
driver, FailoverFS. Observer node is the implementation of standby5.
Fast Path tailing transaction to reduce the gap between Observer and Active.
LastSeenStateId and LastWrittenStateId to help read your own writes. New HDFS
API FileSystem.msync() to solve third-party communication. msync() is similar
to HDFS hsync(). The latter guarantees that the data is available for other
clients to read, while msync() provides the same guarantee for metadata. Now,
the NameNodes perform 150K ops/sec on average, peaking at 250K ops/sec, with
an average latency of 1-2 msec.
The data structure of NameNode is a combination of Java structures, not B±like data structures used in database systems. And the checkpoint is a
full dump of the memory, but it can be implemented in the standby servers
called CheckpointNode6. Simplifying the process of this kind of
functionality with the redundant node is popular in the primary/standby or
multiple-replica deployment of systems. If there are no standby/replica nodes, we can offload the checkpoint to the storage system or some purpose-built facility. A full checkpoint will take a longer time when the footprint of the
metadata grows. Incremental checkpoints may help, but the algorithm can be
complex, and the merge of checkpoints can also be slower.
Footnotes
1 https://www.nextplatform.com/2021/09/15/the-endless-pursuit-of-scale-at-linkedin/
2 https://engineering.linkedin.com/blog/2021/the-exabyte-club--linkedin-s-journey-of-scaling-the-hadoop-distr
3 https://engineering.linkedin.com/blog/2018/02/dynamometer--scale-testing-hdfs-on-minimal-hardware-with-maximum
4 https://engineering.linkedin.com/blog/2021/scaling-linkedin-s-hadoop-yarn-cluster-beyond-10-000-nodes
5 https://issues.apache.org/jira/browse/HDFS-12943
6 https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Checkpoint_Node