pyspark Spark Dataframe中跨行协调ID列的棘手问题

aiqt4smr  于 2023-05-16  发布在  Spark
关注(0)|答案(1)|浏览(137)

我有一组行,其中每个事件行由“EventId”唯一标识。一组事件属于一个组,由**“GUID”和“WFID”标识。问题是,大多数事件不会在同一事件中同时获得两个ID。
示例如下。只有
“WF3”**同时具有“GUID”和“WFID”。因此,需要在其他候选事件(WF1至WF6)之间协调ID:

val df= Seq(
("GUID1",   "",      "WF1", "01-01-2023"),
("GUID1",   "",      "WF2", "01-02-2023"),
("GUID1",   "WFID1", "WF3", "01-03-2023"),
("GUID1",   "",      "WF4", "01-04-2023"),
(""       , "WFID1", "WF5", "01-05-2023"),
("GUID1",   "",      "WF6", "01-06-2023"),
("GUID2",   "",      "WF7", "01-07-2023"),
("",        "WFID2", "WF8", "01-08-2023")
).toDF("GUID", "WFID", "EventId", "Time")
df.show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|     |    WF1|01-01-2023|
|GUID1|     |    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|     |    WF4|01-04-2023|
|     |WFID1|    WF5|01-05-2023|
|GUID1|     |    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

要求是获取所有候选事件的GUID和WFID,以便事件组具有相同的GUID和WFID。在上面的例子中,预期的输出应该是:

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

你知道如何在不使用UDF的情况下在Spark中实现这一点吗?

yftpprvb

yftpprvb1#

这是工作解决方案。让我知道,如果有一个解决方案,没有做任何加入!

val dfDistinct = df.filter(col("GUID") =!= "" && col("WFID") =!= "").select(col("GUID").as("GUID1"), col("WFID").as("WFID1")).distinct()

df.join(dfDistinct, df("GUID") === dfDistinct("GUID1") || df("WFID") === dfDistinct("WFID1"), "left")
.withColumn("GUIDnew", when(col("GUID1").isNotNull, col("GUID1")).otherwise(col("GUID")))
.withColumn("WFIDnew", when(col("WFID1").isNotNull, col("WFID1")).otherwise(col("WFID")))
.select(col("GUIDnew").as("GUID"), col("WFIDnew").as("WFID"), col("EventId"), col("Time"))
.show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

相关问题