spark源码分析之shuffleFetcher

我们首先看看shuffleBlockId是什么样的,比如:shuffle_0_3_3,这个shuffleBlockId是如何组成的呢?

shuffleBlockId=shuffleId+mapId+reduceId,因此这里的shuffleId是0,mapId是3,reduceId是3

现在来看ShuffleFetcher,这个类将会被shuffle的RDD用到,这个抽象类现在只有一种实现,其中最重要的函数是fetch,这个函数是在返回一个iterator,这个iterator中可想而知,肯定有所要拉取的每个block和其对应的节点标志,我们现在主要来分析一下fetch函数。

fetch函数的输入:shuffleId,reduceId,context,serializer

fetch函数的输出:一个iterator,这个iterator中当然有拉取过来的blocks

override def fetch[T](
      shuffleId: Int,
      reduceId: Int,
      context: TaskContext,
      serializer: Serializer)
    : Iterator[T] =
  {

    logDebug("Fetching outputs for shuffle %d, reduce %d".format(shuffleId, reduceId))
    val blockManager = SparkEnv.get.blockManager
    /* 在这个节点中,blockManager用来调用getMultiple函数,从对应的host上拉取属于本节点的block*/

    val startTime = System.currentTimeMillis
    val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
    /*statuses是MapStatues是arrry,其中有成员变量为blockmanageId和shuffle的大小*/
    logDebug("Fetching map output location for shuffle %d, reduce %d took %d ms".format(
      shuffleId, reduceId, System.currentTimeMillis - startTime))
    /*上面有个sparkEnv.get,如果是C++中,获得单例对象的写法类似:sparkEnv.getInstance()*/

    val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]]
    /*BlockManagerId是本节点的blockManagerId,因为blockMangerId是单例对象,这个*/
    for (((address, size), index) <- statuses.zipWithIndex) {
      splitsByAddress.getOrElseUpdate(address, ArrayBuffer()) += ((index, size))
    }

    val blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = splitsByAddress.toSeq.map {
      case (address, splits) =>
        (address, splits.map(s => (ShuffleBlockId(shuffleId, s._1, reduceId), s._2)))
        /* 这里应该是ShuffleBlockId,而不是普通的BlockId
         * splits是一个二元组
         * splits._1是reduce节点要拉取的同一个节点上的第几个块的索引,相当与mapId 
         * splits._2是splits._1对应的大小                  
         * 上面的s._1相当于mapId,s._2相当于
         * */
    }

    def unpackBlock(blockPair: (BlockId, Option[Iterator[Any]])) : Iterator[T] = {
      val blockId = blockPair._1
      val blockOption = blockPair._2
      blockOption match {
        case Some(block) => {
          block.asInstanceOf[Iterator[T]]
        }
        case None => {
          blockId match {
            case ShuffleBlockId(shufId, mapId, _) =>
              val address = statuses(mapId.toInt)._1
              throw new FetchFailedException(address, shufId.toInt, mapId.toInt, reduceId, null)
            case _ =>
              throw new SparkException(
                "Failed to get block " + blockId + ", which is not a shuffle block")
          }
        }
      }
    }

    val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer)
    /* getMultiple函数返回的是BlockFetcherIterator,这个BlockFetcherIterator
     * 有两个实现,分别为NettyBlockFetcherIterator和BasicBlockFetcherIterator
     */
    val itr = blockFetcherItr.flatMap(unpackBlock)

    val completionIter = CompletionIterator[T, Iterator[T]](itr, {
      val shuffleMetrics = new ShuffleReadMetrics
      shuffleMetrics.shuffleFinishTime = System.currentTimeMillis
      shuffleMetrics.remoteFetchTime = blockFetcherItr.remoteFetchTime
      shuffleMetrics.fetchWaitTime = blockFetcherItr.fetchWaitTime
      shuffleMetrics.remoteBytesRead = blockFetcherItr.remoteBytesRead
      shuffleMetrics.totalBlocksFetched = blockFetcherItr.totalBlocks
      shuffleMetrics.localBlocksFetched = blockFetcherItr.numLocalBlocks
      shuffleMetrics.remoteBlocksFetched = blockFetcherItr.numRemoteBlocks
      context.taskMetrics.shuffleReadMetrics = Some(shuffleMetrics)
    })

    new InterruptibleIterator[T](context, completionIter)
  }

fetch函数涉及storage模块中的BlockManager,同时还有MapOutTracker这个类

重要变量:statuses,blocksByAddress,splitsByAddress

重要函数:getMultiple函数,getServerStatuses函数


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值