Spark_rdd_and_map_reduce基本操作

本文介绍了 PySpark 中 RDD 的基本操作,包括 Map、Reduce 和 PairRDD 的使用方法,并通过实例展示了如何进行文本数据的预处理及词频统计。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

from __future__ import print_function, division
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local") \
   .appName("test") \
   .enableHiveSupport() \
   .getOrCreate()

sc = spark.sparkContext

Part1. RDD 的基本操作

RDD 由于采分散式架构,在计算以及操作是使用 Map 与 Reduce 的方式,与一般单执行绪程式逻辑不同.

本节将介绍基本的 Map 与 Reduce 等基本指令,让各位同学能够熟悉如何操作 RDD.

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
print(type(wordsRDD))
<class 'pyspark.rdd.RDD'>

map (Transform)

map 能將 fuction 套用在 rdd 中的每個元素上

def makePlural(word):
    """Adds an 's' to `word`.

    Note:
        This is a simple function that only adds an 's'.  No attempt is made to follow proper
        pluralization rules.

    Args:
        word (str): A string.

    Returns:
        str: A string with 's' added to it.
    """
    return word + 's'

print(makePlural('cat'))
cats
appliedRDD = wordsRDD.map(makePlural)
print(type(appliedRDD))
<class 'pyspark.rdd.PipelinedRDD'>

Collect (Action)

將 RDD 元素送回 Master 並回傳為 List

print(appliedRDD.collect())
print(type(appliedRDD.collect()))
['cats', 'elephants', 'rats', 'rats', 'cats']
<type 'list'>

使用 Lambda function

Lambda function 亦为匿名函数,并不另外定义函数,直接透过 lambda 来宣告函数逻辑.

lambdaRDD = wordsRDD.map(lambda word: word + 's')
print(lambdaRDD.collect())
['cats', 'elephants', 'rats', 'rats', 'cats']

小练习: 计算每个单字长度

lenofRDD = (wordsRDD
            .map(lambda word: len(word))
            .collect())
print(lenofRDD)
[3, 8, 3, 3, 3]

Part2. 使用 Pair RDD 来做计算

Pair RDD

Pair RDD 是一种以 (key, value) 方式储存的RDD

pairRDD = wordsRDD.map(lambda word: (word, 1))
print(pairRDD.take(1))
print(pairRDD)
[('cat', 1)]
PythonRDD[6] at RDD at PythonRDD.scala:48

groupByKey(Transform)

将资料依照 Key 值重新排序

wordsGrouped = pairRDD.groupByKey()
for key, value in wordsGrouped.collect():
    print('{0}: {1}'.format(key, list(value)))
rat: [1, 1]
elephant: [1]
cat: [1, 1]

依照 key 值加总

wordCountsGrouped = wordsGrouped.map(lambda (k,v): (k, sum(v)))
wordCountsGrouped.collect()
[('rat', 2), ('elephant', 1), ('cat', 2)]

更好的解决方案!

reduceBykey (Transform)

reduceByKey 能够直接将资料根据 key 值聚合,减少多余的交换(shuffle)动作

wordCounts = pairRDD.reduceByKey(lambda a,b: a+b)
print(wordCounts.collect())
[('rat', 2), ('elephant', 1), ('cat', 2)]

小练习: 现在你也会写 wordcount

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordRDD = sc.parallelize(wordsList)
wordCountRDD = (wordRDD
                 .map(lambda x: (x, 1))
                 .reduceByKey(lambda x ,y : x + y))
wordCountRDD.collect()
[('rat', 2), ('cat', 2), ('elephant', 1)]

Part3. 寻找不重复值

RDD 提供了多样的现成 API 可供使用,建议先查找官方文件了解现有的 API,避免重复造轮.

disticnt (Transform)

uniqueWords = wordsRDD.map(lambda word: (word, 1)).distinct()
print(uniqueWords.collect())
[('elephant', 1), ('cat', 1), ('rat', 1)]
### count (action)
countUniqueWords = wordsRDD.map(lambda word: (word, 1)).distinct().count()
print(countUniqueWords)
3

小作业: 计算每个字平均出现几次

wordsCount = [('cat',2),('elephant',1),('rat',2)]
wordCountRDD = sc.parallelize(wordsCount)
totalCount = (wordCountRDD
                .map(lambda (x,y): y)
                .reduce(lambda x, y: x + y))
average = totalCount / (wordCountRDD.distinct().count())
print(average)
1.66666666667

Part4. Rdd 也可以当作函数的参数

def wordCount(wordListRDD):
    """Creates a pair RDD with word counts from an RDD of words.

    Args:
        wordListRDD (RDD of str): An RDD consisting of words.

    Returns:
        RDD of (str, int): An RDD consisting of (word, count) tuples.
    """
    return (wordListRDD
           .map(lambda a : (a,1))
           .reduceByKey(lambda a,b: a+b))
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
print(wordCount(wordsRDD).collect())
[('rat', 2), ('elephant', 1), ('cat', 2)]

Part5. 与 python library 互动

使用 pyspark 最方便的地方就是在于可以直接引用 python 的库,并用在 map 或 reduce 中

from operator import add

totalCount = (wordCounts
              .map(lambda (a,b): b)
              .reduce(add))

print(totalCount)
5

Part6. Text Mining

结合目前的课程,最后带大家做个 Text Mining 的例子.
需要读取一个外部 text 文字档,清理文档内容,计算使用频次最高的几个词.

6.1 去除符号

import re
def removePunctuation(text):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        text (str): A string.

    Returns:
        str: The cleaned up string.
    """
    return re.sub("[^a-zA-Z0-9 ]", "", text.strip(" ").lower())
print(removePunctuation('Hi, you!'))
print(removePunctuation(' No under_score!'))
hi you
no underscore

6.2 从 HDFS 读取档案

!hadoop fs -tail /tmp/shakespear.txt
the tillage of thy husbandry?
  Or who is he so fond will be the tomb,
  Of his self-love to stop posterity?
  Thou art thy mother's glass and she in thee
  Calls back the lovely April of her prime,
  So thou through windows of thine age shalt see,
  Despite of wrinkles this thy golden time.
    But if thou live remembered not to be,
    Die single and thine image dies with thee.


                     4
  Unthrifty loveliness why dost thou spend,
  Upon thy self thy beauty's legacy?
  Nature's bequest gives nothing but doth lend,
  And being frank she lends to those are free:
  Then beauteous niggard why dost thou abuse,
  The bounteous largess given thee to give?
  Profitless usurer why dost thou use
  So great a sum of sums yet canst not live?
  For having traffic with thy self alone,
  Thou of thy self thy sweet self dost deceive,
  Then how when nature calls thee to be gone,
  What acceptable audit canst thou leave?
    Thy unused beauty must be tombed with thee,
    Which used lives th' executor to be.
filepath = "hdfs:///tmp/shakespear.txt"
shakespeareRDD = (sc
                  .textFile(filepath)
                  .map(removePunctuation))
shakespeareRDD.collect()
[u'1609',
 u'',
 u'the sonnets',
 u'',
 u'by william shakespeare',
 u'',
 u'',
 u'',
 u'1',
 u'from fairest creatures we desire increase',
 u'that thereby beautys rose might never die',
 u'but as the riper should by time decease',
 u'his tender heir might bear his memory',
 u'but thou contracted to thine own bright eyes',
 u'feedst thy lights flame with selfsubstantial fuel',
 u'making a famine where abundance lies',
 u'thy self thy foe to thy sweet self too cruel',
 u'thou that art now the worlds fresh ornament',
 u'and only herald to the gaudy spring',
 u'within thine own bud buriest thy content',
 u'and tender churl makst waste in niggarding',
 u'pity the world or else this glutton be',
 u'to eat the worlds due by the grave and thee',
 u'',
 u'',
 u'2',
 u'when forty winters shall besiege thy brow',
 u'and dig deep trenches in thy beautys field',
 u'thy youths proud livery so gazed on now',
 u'will be a tattered weed of small worth held',
 u'then being asked where all thy beauty lies',
 u'where all the treasure of thy lusty days',
 u'to say within thine own deep sunken eyes',
 u'were an alleating shame and thriftless praise',
 u'how much more praise deserved thy beautys use',
 u'if thou couldst answer this fair child of mine',
 u'shall sum my count and make my old excuse',
 u'proving his beauty by succession thine',
 u'this were to be new made when thou art old',
 u'and see thy blood warm when thou feelst it cold',
 u'',
 u'',
 u'3',
 u'look in thy glass and tell the face thou viewest',
 u'now is the time that face should form another',
 u'whose fresh repair if now thou not renewest',
 u'thou dost beguile the world unbless some mother',
 u'for where is she so fair whose uneared womb',
 u'disdains the tillage of thy husbandry',
 u'or who is he so fond will be the tomb',
 u'of his selflove to stop posterity',
 u'thou art thy mothers glass and she in thee',
 u'calls back the lovely april of her prime',
 u'so thou through windows of thine age shalt see',
 u'despite of wrinkles this thy golden time',
 u'but if thou live remembered not to be',
 u'die single and thine image dies with thee',
 u'',
 u'',
 u'4',
 u'unthrifty loveliness why dost thou spend',
 u'upon thy self thy beautys legacy',
 u'natures bequest gives nothing but doth lend',
 u'and being frank she lends to those are free',
 u'then beauteous niggard why dost thou abuse',
 u'the bounteous largess given thee to give',
 u'profitless usurer why dost thou use',
 u'so great a sum of sums yet canst not live',
 u'for having traffic with thy self alone',
 u'thou of thy self thy sweet self dost deceive',
 u'then how when nature calls thee to be gone',
 u'what acceptable audit canst thou leave',
 u'thy unused beauty must be tombed with thee',
 u'which used lives th executor to be']
print('\n'.join(shakespeareRDD
                .zipWithIndex()  # to (line, lineNum)
                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'
                .take(15)))
0: 1609
1: 
2: the sonnets
3: 
4: by william shakespeare
5: 
6: 
7: 
8: 1
9: from fairest creatures we desire increase
10: that thereby beautys rose might never die
11: but as the riper should by time decease
12: his tender heir might bear his memory
13: but thou contracted to thine own bright eyes
14: feedst thy lights flame with selfsubstantial fuel

6.3 切词

shakespeareWordsRDD = shakespeareRDD.flatMap(lambda a: a.split(" "))
shakespeareWordCount = shakespeareWordsRDD.count()
print(shakespeareWordsRDD.top(5))
print(shakespeareWordCount)
[u'youths', u'yet', u'wrinkles', u'worth', u'worlds']
461

6.4 移除空白值

shakeWordsRDD = shakespeareWordsRDD.filter(lambda word: len(word) > 0)
shakeWordCount = shakeWordsRDD.count()
print(shakeWordCount)
450

6.5 计算字词数并排序

top15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15, lambda (a,b): -b)
print('\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts)))
thy: 22
thou: 16
the: 15
to: 12
and: 11
of: 10
be: 8
thee: 6
thine: 6
self: 6
dost: 5
so: 5
his: 4
now: 4
where: 4
<think>嗯,用户问的是Hadoop环境中SparkMapReduce是否兼容。我需要先确认自己对这些技术的理解是否正确。Hadoop是一个分布式计算框架,主要包括HDFS存储和MapReduce计算。而Spark是另一个计算框架,可以运行在Hadoop上,特别是利用HDFS作为存储层。 首先,SparkMapReduce都是处理大数据的工具,但它们的执行方式不同。MapReduce基于磁盘的批处理,而Spark利用内存计算,速度更快。但是用户关心的是兼容性,也就是它们能否在同一个Hadoop环境中共存。 我记得Spark可以独立运行,也可以集成到Hadoop YARN资源管理框架上。这意味着Spark可以使用YARN来管理资源,而MapReduce同样使用YARN。所以,在资源管理层面上,它们应该是兼容的,可以共享集群资源,不会互相冲突。 另外,数据存储方面,Hadoop的HDFS是两者的共同存储系统。无论是Spark还是MapReduce,都可以读取和写入HDFS上的数据。所以数据兼容性应该没问题。用户可能需要处理同一份数据,两种框架都能访问。 不过,用户可能需要考虑的是执行引擎的不同。虽然它们可以共存,但任务调度和资源分配需要通过YARN来协调。比如,如果同时运行Spark作业和MapReduce作业,YARN会分配资源给两者,避免资源争用。但需要合理配置资源,防止其中一个占用过多资源导致另一个性能下降。 还有版本兼容性的问题。不同版本的Hadoop、Spark之间可能有兼容性要求。用户需要确保Spark版本与其Hadoop版本匹配,否则可能出现不兼容的情况。例如,Spark的编译版本是否针对用户使用的Hadoop版本。 另外,用户可能想知道在迁移或混合使用时的注意事项。比如,如果已有MapReduce作业,但想用Spark进行更快速的处理,是否需要重写代码?答案是需要,因为Spark有自己的API(如RDD、DataFrame),和MapReduce的Java API不同。不过,Spark也支持多种数据源,可以直接读取MapReduce的输出格式,如SequenceFile。 总结一下,Hadoop环境中SparkMapReduce在资源管理(YARN)和数据存储(HDFS)上是兼容的,可以共存。但执行引擎和API不同,任务需要分别编写,资源需合理配置,版本要匹配。这样用户就能在同一个集群上灵活选择适合的计算框架,根据任务需求使用Spark的高效内存计算或MapReduce的稳定批处理。</think>在Hadoop环境中,SparkMapReduce是兼容的,但两者的工作方式和性能特性存在显著差异。以下是逐步分析: --- ### 1. **底层架构兼容性** - **资源管理层** SparkMapReduce均可通过**YARN**(Hadoop资源管理器)协调资源。 - YARN负责分配集群的CPU和内存资源。 - 两者可共享同一个Hadoop集群,避免资源冲突。 - **数据存储层** SparkMapReduce均支持**HDFS**作为底层存储系统,可直接读写相同数据集。 - 示例: ```bash # MapReduce读取HDFS文件 hadoop jar example.jar input_path output_path # Spark读取同一文件 val data = sc.textFile("hdfs://input_path") ``` --- ### 2. **执行引擎差异** | 特性 | MapReduce | Spark | |---------------------|-------------------------------|--------------------------------| | **计算模式** | 基于磁盘的批处理 | 基于内存的迭代计算 | | **任务启动延迟** | 高(需频繁读写HDFS) | 低(内存缓存优化) | | **API灵活性** | 较固定(需实现Mapper/Reducer)| 丰富(RDD/DataFrame/Dataset) | --- ### 3. **兼容性实践** - **场景1:混合部署** 通过YARN同时运行SparkMapReduce作业,需注意: - 配置YARN资源队列,防止资源争抢(如设置`capacity-scheduler.xml`)。 - 预留足够内存给Spark(避免频繁溢出到磁盘)。 - **场景2:数据交互** Spark可直接读取MapReduce的输出格式(如`SequenceFile`): ```scala val seqData = spark.sparkContext.sequenceFile[Key, Value]("hdfs://mapreduce_output") ``` --- ### 4. **版本兼容性** - **Hadoop版本要求** Spark需与Hadoop版本匹配(如Spark 3.x需Hadoop 3.x或2.7+)。 可通过编译Spark时指定Hadoop版本: ```bash ./dev/make-distribution.sh --tgz -Phadoop-3.2 -Pyarn ``` --- ### 5. **总结** - **兼容性**:SparkMapReduce可在Hadoop环境中共存,共享YARN和HDFS。 - **推荐场景**: - MapReduce:稳定的离线批处理任务。 - Spark:需低延迟或迭代计算的任务(如机器学习)。 - **注意事项**:合理分配资源,避免版本冲突,根据业务需求选择框架。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值