scala—多列最大搜索法和spark单列结果的统一

sf6xfgos  于 2021-05-16  发布在  Spark
关注(0)|答案(1)|浏览(403)

我有以下数据集:

+----+-----+--------+-----+--------+
|  id|date1|address1|date2|address2|
+----+-----+--------+-----+--------+
|   1| 2019|   Paris| 2018|  Madrid|
|   2| 2020|New York| 2002|  Geneva|
|   3| 1998|  London| 2005|   Tokyo|
|   4| 2005|  Sydney| 2013|  Berlin|
+----+-----+-------+------+--------+

我试图获得每个人最近的日期和相应的地址 id 在另外两列。期望的结果是:

+----+-----+--------+-----+--------+--------+-----------+
|  id|date1|address1|date2|address2|date_max|address_max|
+----+-----+--------+-----+--------+--------+-----------+
|   1| 2019|   Paris| 2018|  Madrid|    2019|      Paris|
|   2| 2020|New York| 2002|  Geneva|    2020|   New York| 
|   3| 1998|  London| 2005|   Tokyo|    2005|      Tokyo|
|   4| 2005|  Sydney| 2013|  Berlin|    2013|     Berlin|
+----+-----+-------+------+--------+--------+-----------+

有没有什么办法能让这变得非常有效?

izj3ouym

izj3ouym1#

你可以做一个 CASE WHEN 选择最近的日期/地址:

import org.apache.spark.sql.functions._

val date_max = when(col("date1") > col("date2"), col("date1")).otherwise(col("date2")).alias("date_max")
val address_max = when(col("date1") > col("date2"), col("address1")).otherwise(col("address2")).alias("address_max")

df = df.select("*", date_max, address_max)

如果您想要具有许多列的更具可伸缩性的选项:

val df2 = df.withColumn(
    "all_date",
    array(df.columns.filter(_.contains("date")).map(col): _*)
).withColumn(
    "all_address",
    array(df.columns.filter(_.contains("address")).map(col): _*)
).withColumn(
    "date_max",
    array_max($"all_date")
).withColumn(
    "address_max",
    element_at($"all_address",
        (array_position($"all_date", array_max($"all_date"))).cast("int")
    )
).drop("all_date", "all_address")

df2.show
+---+-----+--------+-----+--------+-------+----------+
| id|date1|address1|date2|address2|datemax|addressmax|
+---+-----+--------+-----+--------+-------+----------+
|  1| 2019|   Paris| 2018|  Madrid|   2019|     Paris|
|  2| 2020| NewYork| 2002|  Geneva|   2020|   NewYork|
|  3| 1998|  London| 2005|   Tokyo|   2005|     Tokyo|
|  4| 2005|  Sydney| 2013|  Berlin|   2013|    Berlin|
+---+-----+--------+-----+--------+-------+----------+

相关问题