groupby并将不同的值聚合为字符串

tcomlyy6  于 2021-07-13  发布在  Spark
关注(0)|答案(3)|浏览(282)

我有一张table如下:

ID   start date     name        type
 1   2020/01/01   cheese,meat    A, B
 1   2020/01/01   cheese,fruit   A, C

所需输出应为:

ID    start date    count                 type 
1     2020/01/01   cheese,meat,fruit      A,B,C

我试过了 collect_list 以及 collect_set ,但两者都不起作用。

but5z9lq

but5z9lq1#

您可以拆分和分解列,然后按分组并收集集合:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'name',
    F.explode(F.split('name', ','))
).withColumn(
    'type',
    F.explode(F.split('type', ','))
).groupBy(
    'ID', 'start date'
).agg(
    F.concat_ws(',', F.collect_set('name')).alias('name'),
    F.concat_ws(',', F.collect_set('type')).alias('type')
)

df2.show()
+---+----------+-----------------+-----+
| ID|start date|             name| type|
+---+----------+-----------------+-----+
|  1|2020/01/01|fruit,meat,cheese|C,B,A|
+---+----------+-----------------+-----+
mrphzbgm

mrphzbgm2#

你可以用 array_distinct 删除后的重复项 collect_set :

from pyspark.sql import functions as F

df1 = df.groupBy("ID", "start date").agg(
    F.concat_ws(",", F.collect_set("name")).alias("name"),
    F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
    "ID",
    "start date",
    F.array_join(F.array_distinct(F.split("name", ",")), ",").alias("name"),
    F.array_join(F.array_distinct(F.split("type", ",")), ",").alias("type")
)

df1.show()

# +---+----------+-----------------+-------+

# | ID|start date|             name|   type|

# +---+----------+-----------------+-------+

# |  1|2020/01/01|cheese,fruit,meat|A, C, B|

# +---+----------+-----------------+-------+

另一种使用 regexp_replace 要删除重复项:

df1 = df.groupBy("ID", "start date").agg(
    F.concat_ws(",", F.collect_set("name")).alias("name"),
    F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
    "ID",
    "start date",
    F.regexp_replace("name", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("name"),
    F.regexp_replace("type", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("type")
)
suzh9iv8

suzh9iv83#

您可以使用:

df.select(
    df.ID,
    df.start_date,
    F.split(df.name, ',').alias('name'),
    F.split(df.type, ',').alias('type')
).groupby('ID', 'start_date').agg(
    F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('name')))).alias('name'),
    F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('type')))).alias('type')
)

结果:

+---+----------+-----------------+-----+
| ID|start_date|             name| type|
+---+----------+-----------------+-----+
|  1|2020/01/01|cheese,meat,fruit|A,B,C|
+---+----------+-----------------+-----+

相关问题