Hadoop NameNode and Checkpoint_msync hadoop-CSDN博客

本文链接：https://blog.csdn.net/zedware/article/details/125237732

这篇博客介绍了LinkedIn如何应对Hadoop集群的元数据管理挑战。随着数据量的增长，LinkedIn的Hadoop系统面临元数据性能瓶颈。为了解决这一问题，他们开发了Dynamometer来测试HDFS性能，并创建了DynoYARN来模拟YARN负载。通过一系列优化，包括Java调优、引入Observer节点和FastPath尾部交易等，NameNode的性能得到了显著提升，达到平均150K ops/秒。此外，还讨论了全量和增量检查点的实现及其复杂性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://zedware.github.io

Management of data is the core task of storage and database systems. Furthermore, less
attention is on metadata, esp. in the traditional systems. Even in the design
of parallel or distributed systems, architectures often assume that metadata can fit in
the memory of a single node. Maybe Hadoop is the most famous system. LinkedIn is
representative of the adoption of a large-scale Hadoop system. Evidence can be
found in its blogs, such as THE ENDLESS PURSUIT OF SCALE AT LINKEDIN¹.

“The largest Hadoop cluster at LinkedIn — which is probably the largest one in
the world, certainly at a commercial entity if not a security agency for a major
nation — has 10,000 nodes and weighed in at 500PB of storage capacity in 2020”.
“The single HDFS NameNode for this monster can serve remote procedure calls with
an average latency of under ten milliseconds, which is pretty good considering
that this HDFS cluster has over 1.6 billion objects (that metric counts
directories, files, and blocks together).” Yes, the cluster has 10,000 nodes,
500 PB of storage and 1.6 billion objects. And the single NameNode consumes
about 380 GB of memory. How did LinkedIn go through the journey²?

“HDFS started running out of gas in 2016, and LinkedIn created Dynamometer to
measure HDFS NameSpace performance and to see how it will perform as more load
is applied to an actual system³.” “YARN started running into
trouble in 2019 as LinkedIn started pushing scale hard, and the company created
DynoYARN to simulate load on YARN clusters at any size and with any workload.
This DynoYARN modeler has been open-sourced as of ⁴, too.” It is
reported that LinkedIn stores 1 exabyte of data across all Hadoop clusters,
while the largest 10,000-node cluster stores 500 PB of data. QJM is used as
the Journal Service. The optimizations include Java tuning, such as the ration
of Young and Tenured generation, global read-write lock in non-fair mode,
Dynamometer to test the workload before deployment. File less than 512MB is
treated as small, logging directory contains many small files (less than 1MB
on average), so moved to a dedicated cluster with the help of a homebrew fs
driver, FailoverFS. Observer node is the implementation of standby⁵.
Fast Path tailing transaction to reduce the gap between Observer and Active.
LastSeenStateId and LastWrittenStateId to help read your own writes. New HDFS
API FileSystem.msync() to solve third-party communication. msync() is similar
to HDFS hsync(). The latter guarantees that the data is available for other
clients to read, while msync() provides the same guarantee for metadata. Now,
the NameNodes perform 150K ops/sec on average, peaking at 250K ops/sec, with
an average latency of 1-2 msec.

The data structure of NameNode is a combination of Java structures, not B±like data structures used in database systems. And the checkpoint is a
full dump of the memory, but it can be implemented in the standby servers
called CheckpointNode⁶. Simplifying the process of this kind of
functionality with the redundant node is popular in the primary/standby or
multiple-replica deployment of systems. If there are no standby/replica nodes, we can offload the checkpoint to the storage system or some purpose-built facility. A full checkpoint will take a longer time when the footprint of the
metadata grows. Incremental checkpoints may help, but the algorithm can be
complex, and the merge of checkpoints can also be slower.