如果值是多个，Pandas将在组中丢弃值

laawzig2 于 2022-11-20 发布在其他

关注(0)|答案(4)|浏览(103)

我有一个包含id列和quantity列（可以是0或1）的 Dataframe 。

import pandas as pd

df = pd.DataFrame([
{'id': 'thing 1', 'date': '2016-01-01', 'quantity': 0 },
  {'id': 'thing 1', 'date': '2016-02-01', 'quantity': 0 },
  {'id': 'thing 1', 'date': '2016-09-01', 'quantity': 1 },
  {'id': 'thing 1', 'date': '2016-10-01', 'quantity': 1 },
  {'id': 'thing 2', 'date': '2017-01-01', 'quantity': 1 },
  {'id': 'thing 2', 'date': '2017-02-01', 'quantity': 1 },
  {'id': 'thing 2', 'date': '2017-02-11', 'quantity': 1 },
  {'id': 'thing 3', 'date': '2017-09-01', 'quantity': 0 },
  {'id': 'thing 3', 'date': '2017-10-01', 'quantity': 0 },
])
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
df

如果对于某个id，我同时有0和1，我只想返回1;如果只有1，我想返回所有1;如果只有0，我想返回所有0。
我这样做的方法是对每个组应用一个函数，然后重置索引：

def drop_that(dff):
    q = len(dff[dff['quantity']==1])
    if q >0:
        return dff[dff['quantity']==1]
    else:
        return dff
    
dfg = df.groupby('id', as_index=False).apply(drop_that)
dfg.reset_index(drop=True)

然而，我只是通过强力谷歌搜索实现了这一点，我真的不知道这是否是一个很好的Pandas实践或如果有替代方法，将是更好的表现。
任何建议都将不胜感激。

pandas

来源：https://stackoverflow.com/questions/66157537/pandas-drop-value-in-a-group-if-values-are-multiple

4条答案

按热度按时间

laximzn51#

您可以尝试：

# find the number of unique quantity for each thing
s = df.groupby('id')['quantity'].transform('nunique')

df[s.eq(1)                 # things with only 1 quantity value (either 0 or 1)
   | df['quantity'].eq(1)  # or quantity==1 when there are 2 values
  ]

输出量：

id       date  quantity
2  thing 1 2016-09-01         1
3  thing 1 2016-10-01         1
4  thing 2 2017-01-01         1
5  thing 2 2017-02-01         1
6  thing 2 2017-02-11         1
7  thing 3 2017-09-01         0
8  thing 3 2017-10-01         0

赞(0）回复(0）举报 2022-11-20

izj3ouym2#

根据您的逻辑，尝试transform与max，如果max eq与原始值相同，则应保留，

#logic : only have 0 or 1  max will be 0 or 1 , 
#        if both have 0 and 1, max should be 1 we should keep all value eq to 1 

out = df[df.quantity.eq(df.groupby('id')['quantity'].transform('max'))]
Out[89]: 
        id       date  quantity
2  thing 1 2016-09-01         1
3  thing 1 2016-10-01         1
4  thing 2 2017-01-01         1
5  thing 2 2017-02-01         1
6  thing 2 2017-02-11         1
7  thing 3 2017-09-01         0
8  thing 3 2017-10-01         0

赞(0）回复(0）举报 2022-11-20

oxosxuxt3#

另一种可能更接近自然语言的解决方案是：

(
    df
    .groupby("id")
    .apply(lambda x: x if x.quantity.unique().size == 1 
                       else x.query("quantity == 1"))
    .reset_index(drop=True)
)

输出量：

#   id       date        quantity
# 0 thing 1  2016-09-01  1
# 1 thing 1  2016-10-01  1
# 2 thing 2  2017-01-01  1
# 3 thing 2  2017-02-01  1
# 4 thing 2  2017-02-11  1
# 5 thing 3  2017-09-01  0
# 6 thing 3  2017-10-01  0

赞(0）回复(0）举报 2022-11-20

iyfjxgzm4#

以下是使用排名的方法：

df.loc[df.groupby('id')['quantity'].rank(method = 'dense',ascending=False).eq(1)]

输出量：

id       date  quantity
2  thing 1 2016-09-01         1
3  thing 1 2016-10-01         1
4  thing 2 2017-01-01         1
5  thing 2 2017-02-01         1
6  thing 2 2017-02-11         1
7  thing 3 2017-09-01         0
8  thing 3 2017-10-01         0

赞(0）回复(0）举报 2022-11-20

我来回答

如果值是多个，Pandas将在组中丢弃值

4条答案

相关问题

热门标签

最新问答