Python:如何匹配拆分和不拆分的单词?

yqyhoc1h  于 2023-01-03  发布在  Python
关注(0)|答案(2)|浏览(134)

我有一个数据框,如下所示,我希望检测重复的话,无论是在分裂或非分裂的话:
表A:

Cat       Comments
Stat A    power down due to electric shock
Stat A    powerdown because short circuit
Stat A    top 10 on re work
Stat A    top10 on rework

我希望得到如下输出:

Repeated words= ['Powerdown', 'top10','on','rework']

有人有主意吗?

unftdfkk

unftdfkk1#

我假设 Dataframe 列中的单词与手头的问题实际上并不相关,因此我将把它们转移到一个列表中,然后搜索重复。

import pandas as pd

df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()

这导致

['power down due to electric shock',
 'powerdown because short circuit',
 'top 10 on re work',
 'top10 on rework']

现在,我们创建一个新列表,以说明"top10"和"top10"应被同等对待的事实:

newa = []
for s in words:
    a = s.split()
    for i in range(len(a)-1):
        w = a[i]+a[i+1]
        a.append(w)
    newa.append(a)

其产生:

[['power',
  'down',
  'due',
  'to',
  'electric',
  'shock',
  'powerdown',
  'downdue',
  'dueto',
  'toelectric',
  'electricshock'],...

最后,我们将列表扁平化,并使用Counter来查找出现不止一次的单词:

from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]

导致

['powerdown', 'on', 'top10', 'rework']
1sbrub3j

1sbrub3j2#

我们试试看:

words = df['Comments'].str.split(' ').explode()

biwords = words + words.groupby(level=0).shift(-1)

(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates),     # remove duplicates words within a comment
            biwords.groupby(level=0).apply(pd.Series.drop_duplicates)])  # remove duplicate bi-words within a comment   
   .dropna()                                             # remove NaN created by shifting                                                    
   .to_frame().join(df[['Cat']])                         # join with original Cat
   .loc[lambda x: x.duplicated(keep=False)]              # select the duplicated `Comments` within `Cat`
   .groupby('Cat')['Comments'].unique()                                 # select the unique values within each `Cat`
)

输出:

Cat
Stat A    [powerdown, on, top10, rework]
Name: Comments, dtype: object

相关问题