scala 如果ID匹配,则替换大数据spark dataframe的选定列值

vcudknz3  于 2023-06-23  发布在  Scala
关注(0)|答案(1)|浏览(113)

我有两个 Dataframe ,它们的模式相差一列。下面的dataframe只是一个例子。
数据框1:

ID Name   Age  Year   Area
1  Alice  20    95     X
2  Bob    30    96     Y
3  Jack   10    98     Z
4  Will   25    99     A

数据框2:

ID Name   Age  Year   Country
10  Alice  20    95     US
20  Bob    30    96     UK
3  Jack    30    24     DE
4  Will    40    25     ES

现在我需要一个scala代码,如果在Dataframe 1中找到匹配的ID,则Dataframe 2列“Age”和“Year”会发生变化。所需的数据框架如下所示。
所需数据框2:

ID Name   Age  Year   Country
10  Alice  20    95     US
20  Bob    30    96     UK
3   Jack   10    98     DE
4   Will   25    99     ES

真实的的 Dataframe 是巨大的。我想要一个scala代码片段,它在大数据计算上是可行的。这可以通过简单的Join语句来实现吗?

qyswt5oh

qyswt5oh1#

您可以对两个 Dataframe 执行左连接,并使用coalesce函数从df1中选择AgeYear(如果不为空),否则从df2中选择
输入:

import spark.implicits._

val data1 = Seq(
        (1, "Alice", 20, 95, "X"),
        (2, "Bob", 30, 96, "Y"),
        (3, "Jack", 10, 98, "Z"),
        (4, "Will", 25, 99, "A")
      )
val df1 = spark.sparkContext.parallelize(data1).toDF("ID", "Name", "Age", "Year", "Area")

val data2 = Seq(
        (10, "Alice", 20, 95, "US"),
        (20, "Bob", 30, 96, "UK"),
        (3, "Jack", 30, 24, "DE"),
        (4, "Will", 40, 25, "ES")
      )
val df2 = spark.sparkContext.parallelize(data2).toDF("ID", "Name", "Age", "Year", "Country")

连接两个 Dataframe :

df2.join(df1.select("ID", "Age", "Year"), "ID", "left_outer")
    .select(df2("ID"), 
            df2("Name"), 
            coalesce(df1("Age"), df2("Age")).as("Age"), 
            coalesce(df1("Year"), df2("Year")).as("Year"), 
            df2("Country")) 
    .show()

输出:

+---+-----+---+----+-------+
| ID| Name|Age|Year|Country|
+---+-----+---+----+-------+
| 10|Alice| 20|  95|     US|
| 20|  Bob| 30|  96|     UK|
|  3| Jack| 10|  98|     DE|
|  4| Will| 25|  99|     ES|
+---+-----+---+----+-------+

相关问题