目录
一、基本输入源:
1.文件输入流
(1)spark-shell
打开一个终端窗口1,(任意目录)启动进入pyspark,
pyspark
// 代码同下
监听程序只监听"…/streaming/logfile"目录下在程序启动后新增的文件,不会去处理历史上已经存在的文件。
logfile中新建日志文件:请打开另外一个终端窗口2,在当前目录下再新建一个log1.txt文件(复制粘贴之前的文件无效),
vim log.txt
运行结果在终端窗口2
(2)独立运行程序
python3 TestStreaming.py
from operator import add
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
conf = SparkConf()
conf.setAppName('TestDStream')
conf.setMaster('local[2]')
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 10)
lines = ssc.textFileStream('file:///home/zzp/PycharmProjects/streaming/logfile')
words = lines.flatMap(lambda line: line.split(' '))
wordCounts = words.map(lambda x : (x,1)).reduceByKey(add)
wordCounts.pprint()
ssc.start()
ssc.awaitTermination()
请打开另外一个终端窗口2,在当前目录logfile中再新建一个log1.txt文件(vim创建)
2.套接字流
新建NetworkWordCount.py代码文件
在当前文件目录下打开终端窗口1,开启监听窗口
sudo nc -lk 9999
在当前文件目录下打开终端窗口2,运行py文件
python3 NetworkWordCount.py localhost 9999
#/usr/local/spark/mycode/streaming/socket/NetworkWordCount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: NetworkWordCount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda wor