合并具有相同凭据的行-pyspark dataframe

ulydmbyx 于 2021-07-09 发布在 Spark

关注(0)|答案(1)|浏览(549)

如何合并pysparkDataframe中满足条件的两行？
例子：
Dataframe

+---+---+------+                                                                
|src|dst|weight|
+---+---+------+
|  8|  7|     1|
|  1|  1|    93|
|  1|  4|     1|
|  4|  4|     2|
|  4|  1|     3|
|  1|  7|     1|
+---+---+------+

条件： (df.src,df.dst) == (df.dst,df.src) 预期产量
求和权重并删除（4,1）

+---+---+------+                                                                
|src|dst|weight|
+---+---+------+
|  8|  7|     1|
|  1|  1|    93|
|  1|  4|     4| #
|  4|  4|     2|
|  1|  7|     1|
+---+---+------+

或
求和权重并删除（1,4）

+---+---+------+                                                                
|src|dst|weight|
+---+---+------+
|  8|  7|     1|
|  1|  1|    93|
|  4|  4|     2|
|  4|  1|     4| #
|  1|  7|     1|
+---+---+------+

python apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/66871789/merging-rows-that-have-same-credentials-pyspark-dataframe

1条答案

按热度按时间

xu3bshqb1#

您可以添加 src_dst 列的排序数组 src 以及 dst ，然后得到每个 src_dst ，并删除 src_dst :

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'src_dst', 
    F.sort_array(F.array('src', 'dst'))
).withColumn(
    'weight', 
    F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')

df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
|  1|  7|     1|
|  1|  1|    93|
|  1|  4|     4|
|  8|  7|     1|
|  4|  4|     2|
+---+---+------+

赞(0）回复(0）举报 2021-07-09

我来回答

合并具有相同凭据的行-pyspark dataframe

1条答案

相关问题

热门标签

最新问答