我试图使用pyspark计算事件中每个userid的会话持续时间,数据示例如下:
diff_session.show(8,False):
|userid|platform |previousTime |currentTime |timeDifference |
|1234 |13 |null |2017-07-20 10:49:30.027|null |
|1234 |13 |null |2017-07-20 10:04:23.1 |null |
|1234 |13 |2017-07-20 10:04:23.1 |2017-07-20 10:06:23.897|120 |
|1234 |13 |2017-07-20 10:04:23.897|2017-07-20 10:40:29.472|2166 |
|1234 |13 |2017-07-20 10:40:29.472|2017-07-20 10:40:50.347|11 |
|1234 |13 |2017-07-20 10:40:30.347|2017-07-20 10:51:16.458|646 |
|1234 |13 |2017-07-20 10:51:16.458|2017-07-20 10:51:17.427|1 |
我想按用户ID和平台分组
然后我想在该组中设置currenttime==previoustime(如果timedifference>2000或timedifference==null),我尝试了以下方法:
from pyspark.sql import SQLContext, functions
df_session.select(df_session.userid, df_session.platform, functions.when(time_difference > 2000) THEN previousTime).otherwise(currentTime)
df_session.select(df_session.userid, df_session.platform, functions.when(time_difference is null) THEN currentTime).otherwise(previousTime)
然后我想把所有的时差加起来,如果它小于2000,让currenttime加上totaltimedifference。结果是:
|userid|platform |previousTime |currentTime |timeDifference |
|1234 |13 |2017-07-20 10:49:30.027|2017-07-20 10:49:30.027|0 |
|1234 |13 |2017-07-20 10:04:23.1 |2017-07-20 10:04:23.1 |0 |
|1234 |13 |2017-07-20 10:04:23.1 |2017-07-20 10:06:23.897|120 |
|1234 |13 |2017-07-20 10:04:23.897|2017-07-20 10:04:23.897|0 |
|1234 |13 |2017-07-20 10:40:29.472|2017-07-20 10:51:17.427|658 |
最后一部分很棘手,我还不知道从哪里开始。谢谢您。
1条答案
按热度按时间rryofs0p1#
希望这有帮助!
别忘了告诉我们它是否解决了你的问题:)