pandas 用“same”一词替换重复的句子

wooyq4lh 于 2022-12-21 发布在其他

关注(0)|答案(2)|浏览(125)

我想用“相同”这个词来改变重复的评论，但是保留原来的评论，并像下面这样改变ID。但是，有些评论并不完全匹配，比如最后三条。

df = {'Key': ['111', '111','111', '222*1','222*2', '333*1','333*2', '333*3','444','444', '444'],
      'id' : ['', '','', '1','2', '1','2', '3','', '','',],
        'comment': ['wrong sentence', 'wrong sentence','wrong sentence', 'M','M', 'F','F', 'F','wrong sentence used in the topic', 'wrong sentence used','wrong sentence use']}
  
# Create DataFrame
df = pd.DataFrame(df)

print(df)

输入：

预期输出：

pandas

来源：https://stackoverflow.com/questions/74788849/replace-the-duplicated-sentences-with-word-same

2条答案

按热度按时间

8aqjt8rx1#

ind = df['comment'].str.contains('wrong sentence')

def my_func(x):
    if len(x['comment'].values[0]) > 1 and len(x) > 1 and ind[x.index[0]]:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, len(x)+1)

df.groupby('Key').apply(my_func)

print(df)

产出

Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

这里，contains被用来匹配"错误句子"，结果是一个布尔掩码。
Groupby应用于'Key'列，分组结果传递给用户定义函数：my_func。在条件被检查的情况下，字符串大于1，字符串大于1并且匹配单词"错误句子"。
loc用于重置值。

- 更新**

def my_func(x):
    unic = x['comment'].str.slice(start=0, stop=10).value_counts().values[0]
    clv = len(x)
    if len(x['comment'].values[0]) > 1 and clv > 1 and unic == clv:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, clv+1)

df.groupby('Key').apply(my_func)

print(df)

赞(0）回复(0）举报 2022-12-21

rt4zxlrg2#

use:

#test first 10 values for duplicates and no `M,F` values
m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
#create consecutive groups only for matched mask and create counter
counter = df.groupby((~m).cumsum().where(m)).cumcount().add(1)

#assign counter only for matched rows
df.loc[m, 'id'] = counter[m]

#assign same for duplicates - it means if counter values greater like 1
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

如果需要，还可按Key组重复检测：

m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
counter = df.groupby(['Key',(~m).cumsum().where(m)]).cumcount().add(1)

df.loc[m, 'id'] = counter[m]
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

赞(0）回复(0）举报 2022-12-21

我来回答

pandas 用“same”一词替换重复的句子

2条答案

相关问题

热门标签

最新问答