Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
>>> rdd=sc.parallelize({('panda',0),('pink',3),('pirate',3),('panda',1),('pink',4)})
>>> rdd.collect()
[('panda', 1), ('pink', 3), ('pirate', 3), ('panda', 0), ('pink', 4)]
>>>
>>> rdd.mapValues(lambda x: (x, 1)).collect()
[('panda', (1, 1)), ('pink', (3, 1)), ('pirate', (3, 1)), ('panda', (0, 1)), ('pink', (4, 1))]
>>> nums=rdd
>>> nums.collect()
[('panda', 1), ('pink', 3), ('pirate', 3), ('panda', 0), ('pink', 4)]
>>>
>>> sumCount = nums.combineByKey((lambda x: (x,1)),(lambda x, y: (x[0] + y, x[1] + 1)),(lambda x, y: (x[0] + y[0], x[1] + y[1])))
>>>
>>> nums.mapValues(lambda x: (x,1)).collect() #比较巧妙的用法
[('panda', (1, 1)), ('pink', (3, 1)), ('pirate', (3, 1)), ('panda', (0, 1)), ('pink', (4, 1))]
>>> nu
Spark大数据分析——pyspark(二)
最新推荐文章于 2024-07-19 22:13:31 发布