聚合和pysparkDataframe中的年度周

ktca8awb 于 2021-05-17 发布在 Spark

关注(0)|答案(1)|浏览(343)

我在Dataframe中有下面的模式

root
 |-- device_id: string (nullable = true)
 |-- eventName: string (nullable = true)
 |-- client_event_time: timestamp (nullable = true)
 |-- eventDate: date (nullable = true)
 |-- deviceType: string (nullable = true)

我想在此数据框中添加以下两列：
wau:每周活动用户数（按周分组的不同设备ID）
周：一年中的一周（需要使用适当的sql函数）
我想使用近似计数。可选关键字rsd也需要设置为.01。
我试着开始写下面这样的东西，但得到了错误。

spark.readStream
.format("delta")
.load(inputpath)
.groupBy(weekofyear('eventDate'))
.count()
.distinct()
.writeStream
.format("delta")
.option("checkpointLocation", outputpath)
.outputMode("complete")
.start(outputpath)

apache-spark pyspark apache-spark-sql delta-lake spark-streaming

来源：https://stackoverflow.com/questions/64820524/aggregation-and-week-of-the-year-in-pyspark-dataframe

1条答案

按热度按时间

js5cn81o1#

根据讨论，下面的代码有效。

spark.readStream
  .format("delta")
  .load(inputdata)
  .groupBy(weekofyear('eventDate').alias('week'))
  .agg(F.approx_count_distinct('device_id', rsd = .01)).alias('WAU')
  .writeStream
  .format("delta")
  .option("checkpointLocation", outputdata)
  .outputMode("complete")
  .start(outputdata)

赞(0）回复(0）举报 2021-05-17

我来回答

聚合和pysparkDataframe中的年度周

1条答案

相关问题

热门标签

最新问答