我有一个Dataframe:
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID1 |2018-08-31 20:00:00|
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
我需要一行最早的时间戳为:
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
我怎么能在Pypark做到这一点:我试过了
df.groupBy("ecid").agg(min("creation_timestamp"))
不过,我只是得到ecid和timestamp字段。我想要所有的领域,而不仅仅是两个领域
2条答案
按热度按时间63lcw9qa1#
使用窗口
row_number
按开划分的函数ecid
并在creation_timestamp
.Example:
```sampledata
df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('ecid').orderBy("creation_timestamp")
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
+--------------+-------------+-------------------+
| ecid|creation_user| creation_timestamp|
+--------------+-------------+-------------------+
|ECID-195000300| USER_ID2|2016-08-31 20:00:00|
+--------------+-------------+-------------------+
ss2ws0br2#
我想你需要一个
window
功能+过滤器。我可以向您提出以下未经测试的解决方案:这个
window
函数允许您返回min
在每个ecid
作为你的新专栏DataFrame