尝试在pysparkDataframe中创建具有最大时间戳的列

ffx8fchx 于 2021-05-18 发布在 Spark

关注(0)|答案(2)|浏览(545)

我对皮斯帕克真的很陌生。我只想找到“date”列的最大值，并在dataframe中为所有行（重复）添加一个具有此最大日期的新列，以便：

A      B                                        C
a  timestamp1                              timestamp3
b  timestamp2    -------------------->     timestamp3
c  timestamp3                              timestamp3

我使用以下代码行：

df.withColumn('dummy_column',f.lit((f.max('date'))).cast('timestamp')).show(9)

但我得到了一个错误：

> AnalysisException: grouping expressions sequence is empty, and
> '`part`' is not an aggregate function. Wrap '(CAST(max(`date`) AS
> TIMESTAMP) AS `new_column`)' in windowing function(s) or wrap '`part`'
> in first() (or first_value) if you don't care which value you get.;;

有人能帮我理解为什么会出现这个错误，以及如何解决这个问题吗？

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/64709879/trying-to-create-a-column-with-the-maximum-timestamp-in-pyspark-dataframe

2条答案

按热度按时间

mm9b1k5b1#

还有一个选择 max_timestamp 只是 grpupBy() 现有Dataframe及其使用 max() 为了得到 max timestamp ，把它收起来 variable 并按您的要求使用

在这里创建df

df = spark.createDataFrame([(1,"2020-10-13"),(2,"2020-10-14"),(3,"2020-10-15")],[ "id","ts"])
df.show()

# df_max = df.groupBy("ts").agg(F.max("ts").alias("max_ts"))

df_max_var = df_max.collect()[0]['max_ts']

# Taking into a variable for future use

df = df.withColumn("dummy_col", F.lit(df_max_var))
df.show()

输入

+---+----------+
| id|        ts|
+---+----------+
|  1|2020-10-13|
|  2|2020-10-14|
|  3|2020-10-15|
+---+----------+

输出

+---+----------+----------+
| id|        ts| dummy_col|
+---+----------+----------+
|  1|2020-10-13|2020-10-15|
|  2|2020-10-14|2020-10-15|
|  3|2020-10-15|2020-10-15|
+---+----------+----------+

赞(0）回复(0）举报 2021-05-19

kupeojn62#

你可能在寻找：

import pyspark.sql.functions as f
from pyspark.sql.window import Window

w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('dummy_column',f.max('date').over(w).cast('timestamp')).show(9)

聚合函数，如 max 使用窗口或分组操作。它们不能单独工作，因为您没有指定聚合函数操作的行范围。

赞(0）回复(0）举报 2021-05-18

我来回答

尝试在pysparkDataframe中创建具有最大时间戳的列

2条答案

在这里创建df

输入

输出

相关问题

热门标签

最新问答