pyspark 在聚合多个列的所有组合时为0值创建行

4zcjmb1e 于 2022-11-01 发布在 Spark

关注(0)|答案(2)|浏览(158)

使用此问题中的示例，在聚合所有可能的组合时，如何创建计数为0的行？使用cube时，不填充计数为0的行。
下面是代码和输出：

df.cube($"x", $"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

但这是所需的输出（添加的第4行）。

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   1|    0|   <- count of records where x = bar AND y = 1
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

是否有其他函数可以做到这一点？

pyspark

来源：https://stackoverflow.com/questions/73833029/create-rows-for-0-values-when-aggregating-all-combinations-of-several-columns

2条答案

按热度按时间

ldioqlga1#

我同意这里的crossJoin是正确的方法。但是我认为以后使用join而不是union和groupBy可能会更通用。特别是当有多个聚合时。

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('foo', 1),
     ('foo', 2),
     ('bar', 2),
     ('bar', 2)],
    ['x', 'y'])

df_cartesian = df.select('x').distinct().crossJoin(df.select("y").distinct())
df_cubed = df.cube('x', 'y').count()
df_cubed.join(df_cartesian, ['x', 'y'], 'full').fillna(0, ['count']).show()

# +----+----+-----+

# |   x|   y|count|

# +----+----+-----+

# |null|null|    4|

# |null|   1|    1|

# |null|   2|    3|

# | bar|null|    2|

# | bar|   1|    0|

# | bar|   2|    2|

# | foo|null|    2|

# | foo|   1|    1|

# | foo|   2|    1|

# +----+----+-----+

赞(0）回复(0）举报 2022-11-01

kwvwclae2#

首先，让我们来看看为什么您没有得到数据集中没有出现的组合。
cube
使用指定的列为当前数据集创建多维多维多维数据集，以便可以对它们运行聚合。有关所有可用的聚合函数，请参阅RelationalGroupedDataset。
正如医生所说cube只是一个花哨的group by。您也可以通过对结果运行explain来检查这一点。您会发现cube基本上是一个扩展（以取得nulls）和group by。因此，它无法显示不在数据集中的组合。为此需要一个连接，以便从不一起出现在同一记录中的值可以“相遇”。
因此，让我们构建一个解决方案：

// this contains one line per possible combination, even those who are not in the dataset
// note that we set the count to 0
val cartesian = df
    .select("x").distinct
    .crossJoin(df.select("y").distinct)
    .withColumn("count", lit(0))

// A dataset in which (2, 1) does not exist
val df = Seq((1, 1), (1, 2), (2, 2)).toDF("x", "y")

// Let's now union the cube with the Cartesian product (CP) and
// reperform a group by.
// Since the counts were set to zero in the CP, this will not impact the
// counts of the cube. It simply adds "missing" values with a count of 0.
df.cube("x", "y").count
    .union(cartesian)
    .groupBy("x", "y")
    .agg(sum('count) as "count")
    .show

其产生：

+----+----+-----+
|   x|   y|count|
+----+----+-----+
|   2|   2|    1|
|   1|   2|    1|
|   1|   1|    1|
|   2|   1|    0|
|null|null|    3|
|   1|null|    2|
|null|   1|    1|
|null|   2|    2|
|   2|null|    1|
+----+----+-----+

赞(0）回复(0）举报 2022-11-01

我来回答

pyspark 在聚合多个列的所有组合时为0值创建行

2条答案

相关问题

热门标签

最新问答