我们首先看看shuffleBlockId是什么样的,比如:shuffle_0_3_3,这个shuffleBlockId是如何组成的呢?
shuffleBlockId=shuffleId+mapId+reduceId,因此这里的shuffleId是0,mapId是3,reduceId是3
现在来看ShuffleFetcher,这个类将会被shuffle的RDD用到,这个抽象类现在只有一种实现,其中最重要的函数是fetch,这个函数是在返回一个iterator,这个iterator中可想而知,肯定有所要拉取的每个block和其对应的节点标志,我们现在主要来分析一下fetch函数。
fetch函数的输入:shuffleId,reduceId,context,serializer
fetch函数的输出:一个iterator,这个iterator中当然有拉取过来的blocks
override def fetch[T](
shuffleId: Int,
reduceId: Int,
context: TaskContext,
serializer: Serializer)
: Iterator[T] =
{
logDebug("Fetching outputs for shuffle %d, reduce %d".format(shuffleId, reduceId))
val blockManager = SparkEnv.get.blockManager
/* 在这个节点中,blockManager用来调用getMultiple函数,从对应的host上拉取属于本节点的block*/
val startTime = System.currentTimeMillis
val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
/*statuses是MapStatues是arrry,其中有成员变量为blockmanageId和shuffle的大小*/
logDebug("Fetching map output location for shuffle %d, reduce %d took %d ms".format(
shuffleId, reduceId, System.currentTimeMillis - startTime))
/*上面有个sparkEnv.get,如果是C++中,获得单例对象的写法类似:sparkEnv.getInstance()*/
val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]]
/*BlockManagerId是本节点的blockManagerId,因为blockMangerId是单例对象,这个*/
for (((address, size), index) <- statuses.zipWithIndex) {
splitsByAddress.getOrElseUpdate(address, ArrayBuffer()) += ((index, size))
}
val blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = splitsByAddress.toSeq.map {
case (address, splits) =>
(address, splits.map(s => (ShuffleBlockId(shuffleId, s._1, reduceId), s._2)))
/* 这里应该是ShuffleBlockId,而不是普通的BlockId
* splits是一个二元组
* splits._1是reduce节点要拉取的同一个节点上的第几个块的索引,相当与mapId
* splits._2是splits._1对应的大小
* 上面的s._1相当于mapId,s._2相当于
* */
}
def unpackBlock(blockPair: (BlockId, Option[Iterator[Any]])) : Iterator[T] = {
val blockId = blockPair._1
val blockOption = blockPair._2
blockOption match {
case Some(block) => {
block.asInstanceOf[Iterator[T]]
}
case None => {
blockId match {
case ShuffleBlockId(shufId, mapId, _) =>
val address = statuses(mapId.toInt)._1
throw new FetchFailedException(address, shufId.toInt, mapId.toInt, reduceId, null)
case _ =>
throw new SparkException(
"Failed to get block " + blockId + ", which is not a shuffle block")
}
}
}
}
val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer)
/* getMultiple函数返回的是BlockFetcherIterator,这个BlockFetcherIterator
* 有两个实现,分别为NettyBlockFetcherIterator和BasicBlockFetcherIterator
*/
val itr = blockFetcherItr.flatMap(unpackBlock)
val completionIter = CompletionIterator[T, Iterator[T]](itr, {
val shuffleMetrics = new ShuffleReadMetrics
shuffleMetrics.shuffleFinishTime = System.currentTimeMillis
shuffleMetrics.remoteFetchTime = blockFetcherItr.remoteFetchTime
shuffleMetrics.fetchWaitTime = blockFetcherItr.fetchWaitTime
shuffleMetrics.remoteBytesRead = blockFetcherItr.remoteBytesRead
shuffleMetrics.totalBlocksFetched = blockFetcherItr.totalBlocks
shuffleMetrics.localBlocksFetched = blockFetcherItr.numLocalBlocks
shuffleMetrics.remoteBlocksFetched = blockFetcherItr.numRemoteBlocks
context.taskMetrics.shuffleReadMetrics = Some(shuffleMetrics)
})
new InterruptibleIterator[T](context, completionIter)
}
fetch函数涉及storage模块中的BlockManager,同时还有MapOutTracker这个类
重要变量:statuses,blocksByAddress,splitsByAddress
重要函数:getMultiple函数,getServerStatuses函数