pyspark 使用最小和最大计数筛选行

sq1bmfud  于 2022-11-01  发布在  Spark
关注(0)|答案(2)|浏览(147)

Dataframe 如下所示:

+---+-----------------------------------------+-----+
|eco|eco_name                                 |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack          |5    |
|D86|Grunfeld, Exchange                       |3    |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd     |5    |
|A44|Old Benoni Defense                       |3    |
|C46|Three Knights                            |1    |
|C08|French, Tarrasch, Open, 4.ed ed          |13   |
|E59|Nimzo-Indian, 4.e3, Main line            |2    |
|A20|English                                  |2    |
|B20|Sicilian                                 |4    |
|B37|Sicilian, Accelerated Fianchetto         |2    |
|A33|English, Symmetrical                     |8    |
|C77|Ruy Lopez                                |8    |
|B43|Sicilian, Kan, 5.Nc3                     |10   |
|A04|Reti Opening                             |6    |
|A59|Benko Gambit                             |1    |
|A54|Old Indian, Ukrainian Variation, 4.Nf3   |3    |
|D30|Queen's Gambit Declined                  |19   |
|C01|French, Exchange                         |3    |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1    |
|E74|King's Indian, Averbakh, 6...c5          |2    |
+---+-----------------------------------------+-----+

结构描述:

root
 |-- eco: string (nullable = true)
 |-- eco_name: string (nullable = true)
 |-- count: long (nullable = false)

我想对它进行筛选,以便只保留具有最小和最大计数的两行。
输出 Dataframe 应类似于:

+---+-----------------------------------------+--------------------+
|eco|eco_name                                 |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined                  |19                  |
|C46|Three Knights                            |1                   |
+---+-----------------------------------------+--------------------+

我是个初学者,如果这是个愚蠢的问题,我真的很抱歉。

but5z9lq

but5z9lq1#

没有必要道歉,因为这是学习的地方!解决方案之一是使用Windowrank来查找最小/最大行:

df = spark.createDataFrame(
    [('a', 1), ('b', 1), ('c', 2), ('d', 3)],
    schema=['col1', 'col2']
)

df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a   |1   |
|b   |1   |
|c   |2   |
|d   |3   |
+----+----+

只需使用过滤功能查找排名后的最小/最大计数行:

df\
    .withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
    .withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
    .filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
    .show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d   |3   |4      |1      |
|a   |1   |1      |3      |
|b   |1   |1      |3      |
+----+----+-------+-------+

请注意,如果最小/最大行数相同,则它们都将被过滤掉。

eqfvzcg8

eqfvzcg82#

可以使用两次row_number函数按计数、升序和降序对记录进行排序。

SELECT eco, eco_name, count
  FROM (SELECT *,
               row_number() over (order by count asc) as rna,
               row_number() over (order by count desc) as rnd
          FROM df)
 WHERE rna = 1 or rnd = 1;

请注意,count = 1有一个并列关系。如果您关心它,可以添加一个辅助排序来控制选择哪个记录,或者可以使用rank来选择所有记录。

相关问题