我有一个Dataframe:
+----------+------------+------------+--------------------+
| acc |id_Vehicule |id_Device |dateTracking |
+----------+------------+------------+--------------------+
| 1 | 1 | 2 |2020-02-12 14:50:00 |
| 0 | 1 | 2 |2020-02-12 14:59:00 |
| 0 | 2 | 3 |2020-02-12 15:10:00 |
| 1 | 2 | 3 |2020-02-12 15:20:00 |
+----------+------------+------------+--------------------+
我想得到输出:
+----------+------------+------------+--------------------+----------------+
| acc |id_Vehicule |id_Device |dateTracking | acc_previous |
+----------+------------+------------+--------------------+----------------+
| 1 | 1 | 2 |2020-02-12 14:50:00 | null |
| 0 | 1 | 2 |2020-02-12 14:59:00 | 1 |
| 0 | 2 | 3 |2020-02-12 15:10:00 | null |
| 1 | 2 | 3 |2020-02-12 15:20:00 | 0 |
+----------+------------+------------+--------------------+----------------+
我尝试了以下代码:
WindowSpec w =org.apache.spark.sql.expressions.Window.partitionBy("idVehicule","idDevice","dateTracking").orderBy("dateTracking");
Dataset <Row> df= df1.withColumn("acc_previous",lag("acc",1).over(w));
df.show();
我得到了结果;
+----------+------------+------------+--------------------+----------------+
| acc |id_Vehicule |id_Device |dateTracking | acc_previous |
+----------+------------+------------+--------------------+----------------+
| 1 | 1 | 2 |2020-02-12 14:50:00 | null |
| 0 | 1 | 2 |2020-02-12 14:59:00 | null |
| 0 | 2 | 3 |2020-02-12 15:10:00 | null |
| 1 | 2 | 3 |2020-02-12 15:20:00 | null |
+----------+------------+------------+--------------------+----------------+
如果你有任何想法,我将非常感激
1条答案
按热度按时间eoigrqb61#
我找到了解决办法,也许能帮助别人。问题是因为“datetracking”列它不应该像分区列,所以我删除了它。