Spark 源码分析

最新推荐文章于 2021-12-07 10:26:48 发布

xiaoyuyulala

最新推荐文章于 2021-12-07 10:26:48 发布

阅读量1.3k

点赞数

CC 4.0 BY-SA版权

分类专栏：大数据组件文章标签： spark 大数据源码

本文链接：https://blog.csdn.net/qq_42192672/article/details/115089282

本文详细介绍了Spark的核心概念，包括Scala的Iterator和Option类型。深入探讨了Spark的高级概念，如Yarn模式运行机制、Master & Worker、作业执行原理和数据倾斜。此外，文章详细讲解了Spark的各种算子，如map、flatMap、filter及其使用案例，重点分析了mapPartitions、groupByKey、ShuffledRDD和reduceByKey等操作的源码和执行原理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark 源码分析

文章内容输出来源：拉勾教育大数据训练营，一堆东西还没学会，希望以后一点一点把漏洞补上，虽然我老鸽王了

基础概念

记录一些自己在scala和spark速成学习中懵圈的概念

Scala Iterator（迭代器）

简介

Scala Iterator（迭代器）不是一个集合，它是一种用于访问集合的方法。
迭代器 it 的两个基本操作是 next 和 hasNext。
调用 it.next() 会返回迭代器的下一个元素，并且更新迭代器的状态。
调用 it.hasNext() 用于检测集合中是否还有元素。

Scala的Option的类型

Option有两个子类别，Some和None。当程序回传Some的时候，代表这个函式成功地给了你一个String，而你可以透过get()函数拿到那个String，如果程序返回的是None，则代表没有字符串可以给你。
在这里插入图片描述

val capitals = Map("1"->"Paris", "2"->"Tokyo", "3"->"Beijing")
scala> capitals.get("1")
res0: Option[String] = Some(Paris)
scala> capitals.get("8")
res1: Option[String] = None

高级概念

Yarn模式运行机制(ing)

Master & Worker(ing)

作业执行原理(ing)

Shuffle详解(ing)

数据倾斜(ing)

算子

map

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
   
   
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

最关键的一行就是new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

这里withScope涉及到的是UI，而clean只是起到了清除闭包中的不能序列化的变量，防⽌RDD在⽹络传输过程中反序列化失败

这里 U 和 T都是泛型，传入的是T，输出的是U

/**
 * An RDD that applies the provided function to every partition of the parent RDD.
 *
 * @param prev the parent RDD.
 * @param f The function used to map a tuple of (TaskContext, partition index, input iterator) to
 *          an output iterator.
 * @param preservesPartitioning Whether the input function preserves the partitioner, which should
 *                              be `false` unless `prev` is a pair RDD and the input function
 *                              doesn't modify the keys.
 * @param isFromBarrier Indicates whether this RDD is transformed from an RDDBarrier, a stage
 *                      containing at least one RDDBarrier shall be turned into a barrier stage.
 * @param isOrderSensitive whether or not the function is order-sensitive. If it's order
 *                         sensitive, it may return totally different result when the input order
 *                         is changed. Mostly stateful functions are order-sensitive.
 */
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
  	// 这里帮我们解释了传入的pid是什么，是分区的区号，Context应该就是封装了相关信息的上下文，task运⾏的环境
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isFromBarrier: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {
   
   

  // 通过判断分区器的标识获得分区器或者是None，从字面上是父RDD的分区器，暂时不深究，总之就是要获得分区器
  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions
	
  // compute方法调用了传进来的函数，对输入的RDD分区进行计算
  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies() {
   
   
    super.clearDependencies()
    prev = null
  }

  @transient protected lazy override val isBarrier_ : Boolean =
    isFromBarrier || dependencies.exists(_.rdd.isBarrier())

  override protected def getOutputDeterministicLevel = {
   
   
    if (isOrderSensitive && prev.outputDeterministicLevel == DeterministicLevel.UNORDERED) {
   
   
      DeterministicLevel.INDETERMINATE
    } else {
   
   
      super.getOutputDeterministicLevel
    }
  }
}

我看源代码比较少，乍一看有点迷，先看注释，然后自己写一下注释

可以看到，其实会调用compute函数，对每一个RDD分区执行我们设置的函数，RDD的数据放在了迭代器中

同时我们找一下firstParent指的具体是什么

/** Returns the first parent RDD */
  protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
   
   
    dependencies.head.rdd.asInstanceOf[RDD[U]]
  }

其实就是返回the first parent RDD，我的理解是MapPartitionsRDD的父RDD其实就是我们本身执行map的RDD，所以相当于想要获取分区号和分区器，都会选择从原先的RDD中去找，可以看到内部第一个参数其实就是var prev: RDD[T]

这里最重要的我觉得是认识了MapPartitionsRDD：一种RDD，比如通过map操作生成的新RDD即为此种类型

flatMap

/**
 *  Return a new RDD by first applying a function to all elements of this
 *  RDD, and then flattening the results.
 */
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
   
   
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

字面意思上我喜欢理解为先map，再flat，整体看上去和map一样，就是map换成了flatMap

filter

/**
 * Return a new RDD containing only the elements that satisfy a predicate.
 */
def filter(f: T => Boolean): RDD[T] = withScope {
   
   
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[T, T](
    this,
    (context, pid, iter) => iter.filter(cleanF),
    preservesPartitioning = true)
}

和上两个算子稍有不同的就是，他的preservesPartitioning一定要是true，一定要保留分区器

案例1:

需求1：不使用 map 算子，对rdd(1 to 10) 中每个元素加1，最后求和

需求2：不使用 map 算子，对rdd(1 to 10) 中每个元素加1，最后做字符串加

package org.apache.spark.rdd

package com.xiaoyuyu.test

import org.apache.spark.{
   
   SparkConf, SparkContext}

/**
 * @Description: 源码分析
 * @Author: Xiaoyuyu
 * @CreateDate: 2021/3/11 7:45 下午
 */

object Test {
   
   

  def main(args: Array[String]): Unit = {
   
   

    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getCanonicalName).setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    val arr: Array[Int] = (1 to 10).toArray
    val value: RDD[Int] = sc.makeRDD(arr)

//    需求1：不使用 map 算子，对rdd(1 to 10) 中每个元素加1，最后求和
    val value1 = new MapPartitionsRDD[Int, Int](value, (_, _, iter) => iter.map(_ + 1))
    value1.foreach(println(_))
//    需求2：不使用 map 算子，对rdd(1 to 10) 中每个元素加1，最后做字符串加
    val str: String = new MapPartitionsRDD[String, Int](value, (_, _, iter) => iter.map(x => (x + 1).toString))
      .reduce(_ + _)
    println(str)
    sc.stop()
  }
}

案例2

需求：不使用 filter 算子，实现filter功能，不保留/保留父RDD的分区器

//    需求：不使用 filter 算子，实现filter功能，不保留/保留 父RDD的分区器
    val value2 = new MapPartitionsRDD[Int, Int](value,
      (_, _, iter) => iter.filter(_ > 2),
      preservesPartitioning = false)
    value2.foreach(println(_))

mapPartitions

这个作为一个小白用的有点少，只知道他是什么功能，先看一下怎么用

/**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
   
   
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

可以看到分区器必须设置为false，底层依然还是一个MapPartitionsRDD，而且可以看到函数f调用和返回的都是Iterator

(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter)，这一行与之前的算子有不同，也就是f在用法上有些许不同，但是暂时不深究，但我们至少知道了函数怎么用，例子如下

val value3: RDD[Int] = value.mapPartitions(
      x => {
   
   
        val x1: Iterator[Int] = x
        val ints: Iterator[Int] = x1.map(_ + 1)
        ints
      }
    )

我特地标注了类型，可以知道操作的每一个分区的载体其实是Iterator[T]

f(index, iter)=> 将函数作用在迭代器上 => mapPartitions的实现
iter.map(cleanF) => 将函数作用在每个元素上 => map的实现

我们还可以看到许多相关的内部函数，我们无法使用，区别就是没用对函数clean过，如下，了解一下即可

在这里插入图片描述

案例1

最后自己实现一下不用mapPartitions的每个元素加一

val value4 = new MapPartitionsRDD(
      value,
      (_, _, iter: Iterator[Int]) => {
   
   
//         iter.map(_+1)
        /*val iter2: RDD[Int] = sc.makeRDD(iter.toList)
        val value5:RDD[Int] = new MapPartitionsRDD[Int, Int](iter2, (_, _, iter) => iter.map(_ + 1))
        val ints1: Array[Int] = value5.collect()
        ints1.toList.iterator*/
        val ints:ListBuffer[Int] = ListBuffer[Int]()
        while(iter.hasNext) {
   
   
          val i: Int = iter.next()
          ints += (i+1)
        }
        ints.iterator
      },
      false)

    value4.foreach(println(_))

mapPartitionsWithIndex

作为一个小白，必须诚实地说mapPartitions我至少知道是啥功能，这个算子我确实不知道咋用

/**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
   
   
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

分区器标识还是必须要设置为false，while tracking the index of the original partition. 最后就是多了这句话

cleanedF(index, iter)这里也是不一样的，体现了对index的跟踪

mapPartitionsWithIndex既可以拿到分区的迭代器，又可以拿到分区编号。

def func(index:Int,iterator: Iterator[Int]):Iterator[Int] = {
   
   
      println(s"index:$index")
      val list: List[Int] = iterator.toList
      list.foreach(
        x => {
   
   
          val string: String = x.toString
          print(string+"\t")
        }
      )
      println()
      list.iterator
    }

    val value5: RDD[Int] = value.mapPartitionsWithIndex(func)
    value5.foreach(println(_))