spark源码分析之shuffleFetcher

最新推荐文章于 2023-11-03 16:39:12 发布

原创最新推荐文章于 2023-11-03 16:39:12 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

spark 专栏收录该内容

13 篇文章

订阅专栏

我们首先看看shuffleBlockId是什么样的，比如：shuffle_0_3_3，这个shuffleBlockId是如何组成的呢？

shuffleBlockId=shuffleId+mapId+reduceId，因此这里的shuffleId是0，mapId是3，reduceId是3

现在来看ShuffleFetcher，这个类将会被shuffle的RDD用到，这个抽象类现在只有一种实现，其中最重要的函数是fetch，这个函数是在返回一个iterator，这个iterator中可想而知，肯定有所要拉取的每个block和其对应的节点标志，我们现在主要来分析一下fetch函数。

fetch函数的输入：shuffleId，reduceId，context，serializer

fetch函数的输出：一个iterator，这个iterator中当然有拉取过来的blocks

override def fetch[T](
      shuffleId: Int,
      reduceId: Int,
      context: TaskContext,
      serializer: Serializer)
    : Iterator[T] =
  {

    logDebug("Fetching outputs for shuffle %d, reduce %d".format(shuffleId, reduceId))
    val blockManager = SparkEnv.get.blockManager
    /* 在这个节点中，blockManager用来调用getMultiple函数，从对应的host上拉取属于本节点的block*/

    val startTime = System.currentTimeMillis
    val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
    /*statuses是MapStatues是arrry，其中有成员变量为blockmanageId和shuffle的大小*/
    logDebug("Fetching map output location for shuffle %d, reduce %d took %d ms".format(
      shuffleId, reduceId, System.currentTimeMillis - startTime))
    /*上面有个sparkEnv.get，如果是C++中，获得单例对象的写法类似：sparkEnv.getInstance()*/

    val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]]
    /*BlockManagerId是本节点的blockManagerId，因为blockMangerId是单例对象，这个*/
    for (((address, size), index) <- statuses.zipWithIndex) {
      splitsByAddress.getOrElseUpdate(address, ArrayBuffer()) += ((index, size))
    }

    val blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = splitsByAddress.toSeq.map {
      case (address, splits) =>
        (address, splits.map(s => (ShuffleBlockId(shuffleId, s._1, reduceId), s._2)))
        /* 这里应该是ShuffleBlockId，而不是普通的BlockId
         * splits是一个二元组
         * splits._1是reduce节点要拉取的同一个节点上的第几个块的索引，相当与mapId 
         * splits._2是splits._1对应的大小                  
         * 上面的s._1相当于mapId，s._2相当于
         * */
    }

    def unpackBlock(blockPair: (BlockId, Option[Iterator[Any]])) : Iterator[T] = {
      val blockId = blockPair._1
      val blockOption = blockPair._2
      blockOption match {
        case Some(block) => {
          block.asInstanceOf[Iterator[T]]
        }
        case None => {
          blockId match {
            case ShuffleBlockId(shufId, mapId, _) =>
              val address = statuses(mapId.toInt)._1
              throw new FetchFailedException(address, shufId.toInt, mapId.toInt, reduceId, null)
            case _ =>
              throw new SparkException(
                "Failed to get block " + blockId + ", which is not a shuffle block")
          }
        }
      }
    }

    val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer)
    /* getMultiple函数返回的是BlockFetcherIterator，这个BlockFetcherIterator
     * 有两个实现，分别为NettyBlockFetcherIterator和BasicBlockFetcherIterator
     */
    val itr = blockFetcherItr.flatMap(unpackBlock)

    val completionIter = CompletionIterator[T, Iterator[T]](itr, {
      val shuffleMetrics = new ShuffleReadMetrics
      shuffleMetrics.shuffleFinishTime = System.currentTimeMillis
      shuffleMetrics.remoteFetchTime = blockFetcherItr.remoteFetchTime
      shuffleMetrics.fetchWaitTime = blockFetcherItr.fetchWaitTime
      shuffleMetrics.remoteBytesRead = blockFetcherItr.remoteBytesRead
      shuffleMetrics.totalBlocksFetched = blockFetcherItr.totalBlocks
      shuffleMetrics.localBlocksFetched = blockFetcherItr.numLocalBlocks
      shuffleMetrics.remoteBlocksFetched = blockFetcherItr.numRemoteBlocks
      context.taskMetrics.shuffleReadMetrics = Some(shuffleMetrics)
    })

    new InterruptibleIterator[T](context, completionIter)
  }

fetch函数涉及storage模块中的BlockManager，同时还有MapOutTracker这个类

重要变量：statuses，blocksByAddress，splitsByAddress

重要函数：getMultiple函数，getServerStatuses函数