下面的程序运行一个简单的字数计数来测试spark结构化流。我在终端上写单词,然后在另一个终端上运行程序。写下这些字后,需要15-20秒才能在第二个终端上显示输出。有没有办法我可以减少输出时间,因为它很长。有人请帮帮我
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
Terminal where I am connecting to port and writing the words
C:\Program Files (x86)\Nmap>ncat -lvp 9999
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Listening on :::9999
Ncat: Listening on 0.0.0.0:9999
Ncat: Connection from 127.0.0.1.
Ncat: Connection from 127.0.0.1:44577.
apacheapaop
apache
spark
apache
hadoop
hello
world
hello
hello guys guys
hello
Output terminal where I am counting words
Batch: 2
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
|apacheapaop| 1|
| hello| 1|
| apache| 2|
| spark| 1|
| | 5|
| hadoop| 1|
+-----------+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
|apacheapaop| 1|
| hello| 2|
| apache| 2|
| spark| 1|
| world| 1|
| | 5|
| hadoop| 1|
+-----------+-----+
-------------------------------------------
Batch: 4
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
| guys| 2|
|apacheapaop| 1|
| hello| 3|
| apache| 2|
| spark| 1|
| world| 1|
| | 6|
| hadoop| 1|
+-----------+-----+
-------------------------------------------
Batch: 5
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
| guys| 2|
|apacheapaop| 1|
| hello| 4|
| apache| 2|
| spark| 1|
| world| 1|
| | 6|
| hadoop| 1|
+-----------+-----+
接收终端上的输出(每批)需要15-20秒…如何减少延迟
暂无答案!
目前还没有任何答案,快来回答吧!