regex Python中的正则表达式条件

oyt4ldly  于 2023-08-08  发布在  Python
关注(0)|答案(2)|浏览(87)

我有一个包含文本的数据框列technician_verbatim。

data = {'technician_verbatim': ["not leak not secured not fitted not parameter"]}
    
Delta_Claims = pd.DataFrame(data)

字符串
我有一个变量tmp_list,它包含一些列表的单词。tmp_list = ['leak',“secu”,“fitt”,“帕拉”,“secur”,“fitte”,“param”]
我有另一个df not_no_without_exceptions,这是一个单词袋。

data = {'not_exceptions': ["secure","secured","fit","fitted"]}

not_no_without_exceptions = pd.DataFrame(data)


现在我写了一段代码来删除technician_verbatim中not后面的单词。例如[not accepted,not sober,not tight,not loose],但如果not后面的单词在not_exceptions列中可用,则不应与not一起删除。
tmp_list是technician_verbatim中not之后可用的单词列表。我只拉了4和5个字母后没有。
所以在我上面的例子“secu”,“fitt”,“secur”,“fitte”这些词与not_exceptions列表匹配。因此它们将保留在technician_verbatim中。最终输出为:“未固定未安装”
我的预期输出:

data = {'New_technician_verbatim': ["not secured not fitted"]}
        
    Delta_Claims = pd.DataFrame(data)


下面是我尝试过的代码:

import pandas as pd
import re

Delta_Claims = pd.read_csv('C:\\path\\hybrid_iqm - Copy.csv')

not_no_without_exceptions = pd.read_csv('C:\\path\\not_no_without_exceptions.csv')

not_exceptions = not_no_without_exceptions['not_exceptions'].tolist()

tmp_list = ['leak', "secu", "fitt", "para", "secur", "fitte", "param"]

tmp_list = [f'{word.strip()}' for words in tmp_list for word in words.split()]


# Create a copy of the technician_verbatim column to avoid modifying the original DataFrame
tmp_container_verb = Delta_Claims['technician_verbatim'].copy()

def remove_words(row):
    for i in tmp_list:
        if i not in not_exceptions:
            escaped_i = re.escape(i)
#               Updated pattern to match partial words with spaces or word boundaries before and after
            pattern = rf'\bnot\s+{escaped_i}\s*|\b{escaped_i}\s*'
#             pattern = rf'not\s+{escaped_i}\s*|{escaped_i}\s*'

            row = re.sub(pattern, '', row, flags=re.IGNORECASE)
    return row

# Apply the remove_words function to each row in tmp_container_verb
Delta_Claims['tmp_container_verb'] = tmp_container_verb.apply(remove_words)

# Now tmp_container_verb contains the modified technician_verbatim values
print(Delta_Claims)


我不明白,这一行有什么问题,为什么它没有删除部分匹配(不是参数),它只删除了“不是参数”,并保持“eter”

pattern = rf'\bnot\s+{escaped_i}\s*|\b{escaped_i}\s*'

6tr1vspr

6tr1vspr1#

我有两个选择。
预过滤tmp_list以删除与not_no_without_exceptions中的任何单词匹配的单词,然后创建一个模式并replace

import re

tmp_list2 = [w for w in tmp_list if not
             any(w2.startswith(w) for w2 in
             not_no_without_exceptions['not_exceptions'])]
pattern = fr"not ({'|'.join(map(re.escape, tmp_list2))})\w*\b\s*"
# '(not (?:secure|secured|fit|fitted)\w*\b)'

Delta_Claims['New_technician_verbatim'] = (Delta_Claims['technician_verbatim']
                                           .str.replace(pattern, '', regex=True)
                                          )

字符串
regex demo
或者,不要使用单词列表,只extractall来自not_no_without_exceptions的单词(和join它们):

import re

pattern = f'(not (?:{"|".join(map(re.escape, not_no_without_exceptions["not_exceptions"]))})\w*\b)'
# 'not (leak|para|param)\w*\b\s*'

Delta_Claims['New_technician_verbatim'] = (Delta_Claims['technician_verbatim']
                                           .str.extractall(pattern)[0]
                                           .groupby(level=0).agg(' '.join)
                                          )


regex demo
输出量:

technician_verbatim New_technician_verbatim
0  not leak not secured not fitted not parameter  not secured not fitted

xmakbtuz

xmakbtuz2#

仅更改此行:-pattern = rf'\bnot\s+{escaped_i}\s*|\b{escaped_i}\s*'
对此:

pattern = rf'not\s+{escaped_i}\S*\s*|not\s+{escaped_i}\S*\s*\w+'

字符串

相关问题