Pyspark -对具有相同ID的表的行求和

vpfxa7rd  于 2022-11-21  发布在  Spark
关注(0)|答案(2)|浏览(171)

我正在尝试合并两个表,以便将具有相同ID(该ID可能是多个列的组合)的行加总在一起。仍然保留与任一表都不匹配的行。
下面是我所想的一个例子。C1C2是ID列,而W1W2是要求和的行。
DF A:
| C1级|C2级|第一周|第二周|
| - -|- -|- -|- -|
| A级|A级|五个|三个|
| A级|B|六个|七个|
| B| A级|2个|三个|
| B| B|八个|八个|
DFA
数据块B:
| C1级|C2级|第一周|第二周|
| - -|- -|- -|- -|
| A级|A级|2个|七个|
| A级|C语言|四个|八个|
| B| A级|九个|一个|
| B| D级|三个|五个|
DFB
结果 Dataframe
| C1级|C2级|第一周|第二周|
| - -|- -|- -|- -|
| A级|A级|七个|10个|
| A级|B|六个|七个|
| A级|C语言|四个|八个|
| B| A级|十一|四个|
| B| B|八个|八个|
| B| D级|三个|五个|
Resulting DF
我有两个想法,但目前还不能实现。第一个是在两个 Dataframe 上做一个联合,然后做一个groupby聚合,但我不确定这是否是最有效的方法。
另一个是在两个 Dataframe 上进行外部连接,然后对公共列求和。我在实现这一点时遇到了麻烦,因为ID列来自一个列表,并且是不时变化的变量,因此连接不能硬编码。

ldioqlga

ldioqlga1#

full_join将更容易实现,即使您列列表会随时间而变化。
下面是一个使用您的示例的示例

id_cols = ['c1', 'c2']
sum_cols = ['w1', 'w2']

data1_sdf.alias('a'). \
    join(data2_sdf.alias('b'), id_cols, 'full'). \
    selectExpr(*id_cols,
               *['(coalesce(a.{0}, 0) + coalesce(b.{0}, 0)) as {0}'.format(c) for c in sum_cols]
               ). \
    show()

# +---+---+---+---+
# | c1| c2| w1| w2|
# +---+---+---+---+
# |  A|  A|  7| 10|
# |  A|  C|  4|  8|
# |  B|  B|  8|  8|
# |  B|  A| 11|  4|
# |  B|  D|  3|  5|
# |  A|  B|  6|  7|
# +---+---+---+---+
t40tm48m

t40tm48m2#

将两个 Dataframe 连接为“full”,并对非空的列求和:

dfa = spark.createDataFrame(data=[["A","A","5","3"],["A","B","6","7"],["B","A","2","3"],["B","B","8","8"]], schema=["C1","C2","W1","W2"])

dfb = spark.createDataFrame(data=[["A","A","2","7"],["A","C","4","8"],["B","A","9","1"],["B","D","3","5"]], schema=["C1","C2","W1","W2"])

result_df = dfa \
  .join(dfb, (dfa.C1 == dfb.C1) & (dfa.C2 == dfb.C2), how="full") \
  .withColumn("C1_new", F.coalesce(dfa.C1, dfb.C1)) \
  .withColumn("C2_new", F.coalesce(dfa.C2, dfb.C2)) \
  .withColumn("W1_sum", F.when(dfa.W1.isNull(), dfb.W1).when(dfb.W1.isNull(), dfa.W1).otherwise((dfa.W1 + dfb.W1).cast('int'))) \
  .withColumn("W2_sum", F.when(dfa.W2.isNull(), dfb.W2).when(dfb.W2.isNull(), dfa.W2).otherwise((dfa.W2 + dfb.W2).cast('int'))) \
  .drop("C1", "C2", "W1", "W2")

[Out]:
+------+------+------+------+
|C1_new|C2_new|W1_sum|W2_sum|
+------+------+------+------+
|     A|     A|     7|    10|
|     A|     B|     6|     7|
|     A|     C|     4|     8|
|     B|     A|    11|     4|
|     B|     B|     8|     8|
|     B|     D|     3|     5|
+------+------+------+------+

相关问题