如何将数据从Google PubSub主题流传输到PySpark(在Google云上)

kgqe7b3p 于 2022-12-17 发布在 Spark

关注(0)|答案(3)|浏览(185)

我在Google PubSub中将数据流传输到一个主题中，我可以使用简单的Python代码来查看数据：

...
def callback(message):
    print(datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") + ": message = '" + message.data + "'")
    message.ack()

future = subscriber.subscribe(subscription_name, callback)
future.result()

上面的python代码接收来自Google PubSub主题的数据（subscriber subscriber_name）并将其写入终端，正如预期的那样。我希望将来自主题的相同数据流传输到PySpark（RDD或dataframe），这样我就可以在PySpark中执行其他流传输转换，如窗口化和聚合，如下所述：https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html .
这个链接有阅读其他流媒体资源的文档，（例如Kafka），但不是谷歌PubSub。有没有办法从谷歌PubSub流到PySpark？

pyspark

来源：https://stackoverflow.com/questions/52375509/how-can-i-stream-data-from-a-google-pubsub-topic-into-pyspark-on-google-cloud

3条答案

按热度按时间

muk1a3rh1#

您可以使用**Apache Bahir，它为Apache Spark提供了扩展，包括Google Cloud Pub/Sub的连接器。
您可以找到an example from Google Cloud Platform，它使用Spark on Kubernetes计算从Google Cloud PubSub主题接收的数据流的字数，并将结果写入Google Cloud Storage（GCS）存储桶。
another example使用DStream**在云数据处理器上部署Apache Spark流应用程序，并处理来自云发布/订阅的消息。

赞(0）回复(0）举报 2022-12-17

dldeef672#

您可以使用Apache Beam：https://beam.apache.org/
Apache Beam支持Pyhton云发布/订阅：https://beam.apache.org/documentation/io/built-in/
有一个Python SDK：https://beam.apache.org/documentation/sdks/python/
以及对Spark的支持：https://beam.apache.org/documentation/runners/capability-matrix/

赞(0）回复(0）举报 2022-12-17

x33g5p2x3#

我相信你可以用这个：https://cloud.google.com/pubsub/lite/docs/samples/pubsublite-spark-streaming-from-pubsublite
您创建一个订阅并将其放入spark stream中的选项。

赞(0）回复(0）举报 2022-12-17