ApacheSpark结构化流媒体打印输出耗时长的字数统计示例

inn6fuwd  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(195)

下面的程序运行一个简单的字数计数来测试spark结构化流。我在终端上写单词,然后在另一个终端上运行程序。写下这些字后,需要15-20秒才能在第二个终端上显示输出。有没有办法我可以减少输出时间,因为它很长。有人请帮帮我

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words

words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count

wordCounts = words.groupBy("word").count()

query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

Terminal where I am connecting to port and writing the words
C:\Program Files (x86)\Nmap>ncat -lvp 9999
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Listening on :::9999
Ncat: Listening on 0.0.0.0:9999
Ncat: Connection from 127.0.0.1.
Ncat: Connection from 127.0.0.1:44577.
apacheapaop

apache
spark
apache
hadoop
hello
world
hello
hello guys guys
hello
Output terminal where I am counting words
Batch: 2
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|apacheapaop|    1|
|      hello|    1|
|     apache|    2|
|      spark|    1|
|           |    5|
|     hadoop|    1|
+-----------+-----+

-------------------------------------------
Batch: 3
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|apacheapaop|    1|
|      hello|    2|
|     apache|    2|
|      spark|    1|
|      world|    1|
|           |    5|
|     hadoop|    1|
+-----------+-----+

-------------------------------------------
Batch: 4
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|       guys|    2|
|apacheapaop|    1|
|      hello|    3|
|     apache|    2|
|      spark|    1|
|      world|    1|
|           |    6|
|     hadoop|    1|
+-----------+-----+

-------------------------------------------
Batch: 5
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|       guys|    2|
|apacheapaop|    1|
|      hello|    4|
|     apache|    2|
|      spark|    1|
|      world|    1|
|           |    6|
|     hadoop|    1|
+-----------+-----+

接收终端上的输出(每批)需要15-20秒…如何减少延迟

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题