python-3.x 如何根据特定条件过滤出 Dataframe ?

kknvjkwl  于 2023-01-22  发布在  Python
关注(0)|答案(3)|浏览(149)

**方案:**如果一个编号有两条记录,一条带有删除原因,另一条带有撤消删除原因,则只选择删除。如果一个编号只有一条带有撤消删除或删除原因的记录,则也会选择该编号。
**示例:**我有一个包含2列的 Dataframe ,如下所示:

| 数|原因|
| - ------|- ------|
| 1234|德尔|
| 1234|温德尔|
| 小行星4567|德尔|
| 小行星6789|温德尔|
| 小行星2423|德尔|
| 小行星2423|温德尔|
我的预期输出如下所示:
| 数|原因|过滤器|
| - ------|- ------|- ------|
| 1234|德尔|1个|
| 1234|温德尔|无|
| 小行星4567|德尔|1个|
| 小行星6789|温德尔|1个|
| 小行星2423|德尔|1个|
| 小行星2423|温德尔|无|
这里我只想过滤掉tofilter为1的记录。

xeufq47z

xeufq47z1#

您可以从您提供的 Dataframe 中生成预期的输出,假设只有两个样本,以防数字重复。

data = {'number': ['1234', '1234', '4567', '6789', '2423', '2423'],
    'reason': ['del', 'undel', 'del', 'undel', 'del', 'undel']} 
df = pd.DataFrame(data, columns=['number', 'reason'])

unique_vals = df.number.value_counts().to_dict()

for index, item in enumerate(df.iterrows()):
    nbr = item[1]['number']
    filter_check = unique_vals[nbr]
    indexes = df[ df['number'] == nbr ]['reason'].index.to_list()
    if filter_check == 2:
        if 'to_filter' not in df.columns:
             df.insert(2, "to_filter", pd.Series([1, 0], index=[idx for idx in indexes]), allow_duplicates= False)
        else:
            df['to_filter'].iloc[indexes[0]] = 1
            df['to_filter'].iloc[indexes[1]] = 0
        
    elif filter_check == 1:
        df['to_filter'].iloc[indexes[0]] = 1

其输出:

number  reason  to_filter
0   1234    del 1.0
1   1234    undel   0.0
2   4567    del 1.0
3   6789    undel   1.0
4   2423    del 1.0
5   2423    undel   0.0
hc8w905p

hc8w905p2#

下面是我对窗口函数的尝试
当你最多有2条记录,一条带del,第二条带undel时,它只适用于描述的场景。如果有重复的记录,它需要调整

import datetime
from pyspark.sql import Window
import pyspark.sql.functions as F

x = [
    (1234, "del"),
    (1234, "undel"),
    (4567, "del"),
    (6789, "undel"),
    (2423, "del"),
    (2423, "undel"),
]
df = spark.createDataFrame(x, schema=["number", "reason"])

window = Window.partitionBy("number").orderBy(F.col("reason").asc())
dfWithRowNumber = df.withColumn("row_number", F.row_number().over(window))
dfWithToFilterColumn = dfWithRowNumber.withColumn(
    "toFilter", F.when(F.col("row_number") == F.lit(1), F.lit(1)).otherwise(F.lit(0))
).drop("row_number")

dfWithToFilterColumn.show()

输出为:

+------+------+--------+
|number|reason|toFilter|
+------+------+--------+
|  1234|   del|       1|
|  1234| undel|       0|
|  2423|   del|       1|
|  2423| undel|       0|
|  4567|   del|       1|
|  6789| undel|       1|
+------+------+--------+
dbf7pr2w

dbf7pr2w3#

窗口功能是你的朋友,如果你想过滤做以下

(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
 .where((col('x')==1))#Filter out duplicated reasons in each group
 .drop('x')#Drop the filter column
).show()

+------+------+
|number|reason|
+------+------+
|  1234|   del|
|  2423|   del|
|  4567|   del|
|  6789| undel|
+------+------+

如果要显示0和1,请执行以下操作

(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
 .withColumn('x', when(col('x')==1,1).otherwise(0))#Filter out duplicated reasons in each group
 
).show()

+------+------+---+
|number|reason|  x|
+------+------+---+
|  1234|   del|  1|
|  1234| undel|  0|
|  2423|   del|  1|
|  2423| undel|  0|
|  4567|   del|  1|
|  6789| undel|  1|
+------+------+---+

相关问题