python PySpark删除重复项并保留列中具有最高值的行

gxwragnw 于 2022-12-28 发布在 Python

关注(0)|答案(3)|浏览(176)

我有以下Spark数据集：

id    col1    col2    col3    col4
1      1        5       2      3
1      1        0       2      3
2      3        1       7      7
3      6        1       3      3
3      6        5       3      3

我想删除columns subset ['id，'col1 '，'col3'，'col4 ']中的重复项，并保留col2中具有最高值的重复行。

id    col1    col2    col3    col4
1      1        5       2      3
2      3        1       7      7
3      6        5       3      3

在PySpark中我怎么做呢？

python

来源：https://stackoverflow.com/questions/74936051/pyspark-drop-duplicates-and-keep-rows-with-highest-value-in-a-column

3条答案

按热度按时间

mzsu5hc01#

分组并得到col2的最大值

df = df.groupby(['id','col1','col3','col4']).max('col2')

赞(0）回复(0）举报 2022-12-28

fjnneemd2#

另一种方法是计算max，filter，其中max=col2。这允许您保留条件为真的多个示例

df.withColumn('max',max('col2').over(Window.partitionBy('id'))).where(col('col2')==col('max')).show()

赞(0）回复(0）举报 2022-12-28

gkl3eglg3#

如果你更熟悉SQL语法而不是PySpark Dataframe API，你可以这样做：
创建 Dataframe （可选，因为您已经有数据）

from pyspark.sql.types import StructType,StructField, IntegerType

data = [
  (1,      1,        5,       2,      3),
  (1,      1,        0,       2,      3),
  (2,      3,        1,       7,      7),
  (3,      6,        1,       3,      3),
  (3,      6,        5,       3,      3),
]

schema = StructType([ \
    StructField("id",IntegerType()), \
    StructField("col1",IntegerType()), \
    StructField("col2",IntegerType()), \
    StructField("col3", IntegerType()), \
    StructField("col4", IntegerType()), \
  ])

df = spark.createDataFrame(data=data,schema=schema)
df.show()

然后创建一个 Dataframe 的视图来运行SQL查询。下面创建一个新的临时 Dataframe 视图，名为“tbl”。

# create view from df called "tbl"
df.createOrReplaceTempView("tbl")

最后用视图编写一个SQL查询，这里我们按id、col1、col3和col4分组，然后选择col2的最大值所在的行。

# query to group by id,col1,col3,col4 and select max col2
my_query = """
select 
  id, col1, max(col2) as col2, col3, col4
from tbl
group by id, col1, col3, col4
"""

new_df = spark.sql(my_query)
new_df.show()

最终输出：

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  1|   1|   5|   2|   3|
|  2|   3|   1|   7|   7|
|  3|   6|   5|   3|   3|
+---+----+----+----+----+

赞(0）回复(0）举报 2022-12-28

我来回答

python PySpark删除重复项并保留列中具有最高值的行

3条答案

相关问题

热门标签

最新问答