regex Python中的正则表达式条件

oyt4ldly 于 2023-08-08 发布在 Python

关注(0)|答案(2)|浏览(86)

我有一个包含文本的数据框列technician_verbatim。

data = {'technician_verbatim': ["not leak not secured not fitted not parameter"]}
    
Delta_Claims = pd.DataFrame(data)

字符串
我有一个变量tmp_list，它包含一些列表的单词。tmp_list = ['leak'，“secu”，“fitt”，“帕拉”，“secur”，“fitte”，“param”]
我有另一个df not_no_without_exceptions，这是一个单词袋。

data = {'not_exceptions': ["secure","secured","fit","fitted"]}

not_no_without_exceptions = pd.DataFrame(data)

型
现在我写了一段代码来删除technician_verbatim中not后面的单词。例如[not accepted，not sober，not tight，not loose]，但如果not后面的单词在not_exceptions列中可用，则不应与not一起删除。
tmp_list是technician_verbatim中not之后可用的单词列表。我只拉了4和5个字母后没有。
所以在我上面的例子“secu”，“fitt”，“secur”，“fitte”这些词与not_exceptions列表匹配。因此它们将保留在technician_verbatim中。最终输出为：“未固定未安装”
我的预期输出：

data = {'New_technician_verbatim': ["not secured not fitted"]}
        
    Delta_Claims = pd.DataFrame(data)

型
下面是我尝试过的代码：

import pandas as pd
import re

Delta_Claims = pd.read_csv('C:\\path\\hybrid_iqm - Copy.csv')

not_no_without_exceptions = pd.read_csv('C:\\path\\not_no_without_exceptions.csv')

not_exceptions = not_no_without_exceptions['not_exceptions'].tolist()

tmp_list = ['leak', "secu", "fitt", "para", "secur", "fitte", "param"]

tmp_list = [f'{word.strip()}' for words in tmp_list for word in words.split()]


# Create a copy of the technician_verbatim column to avoid modifying the original DataFrame
tmp_container_verb = Delta_Claims['technician_verbatim'].copy()

def remove_words(row):
    for i in tmp_list:
        if i not in not_exceptions:
            escaped_i = re.escape(i)
#               Updated pattern to match partial words with spaces or word boundaries before and after
            pattern = rf'\bnot\s+{escaped_i}\s*|\b{escaped_i}\s*'
#             pattern = rf'not\s+{escaped_i}\s*|{escaped_i}\s*'

            row = re.sub(pattern, '', row, flags=re.IGNORECASE)
    return row

# Apply the remove_words function to each row in tmp_container_verb
Delta_Claims['tmp_container_verb'] = tmp_container_verb.apply(remove_words)

# Now tmp_container_verb contains the modified technician_verbatim values
print(Delta_Claims)

型
我不明白，这一行有什么问题，为什么它没有删除部分匹配（不是参数），它只删除了“不是参数”，并保持“eter”

pattern = rf'\bnot\s+{escaped_i}\s*|\b{escaped_i}\s*'

型

regex

来源：https://stackoverflow.com/questions/76768798/stuck-in-regex-condition-in-python

2条答案

按热度按时间

6tr1vspr1#

我有两个选择。
预过滤tmp_list以删除与not_no_without_exceptions中的任何单词匹配的单词，然后创建一个模式并replace：

import re

tmp_list2 = [w for w in tmp_list if not
             any(w2.startswith(w) for w2 in
             not_no_without_exceptions['not_exceptions'])]
pattern = fr"not ({'|'.join(map(re.escape, tmp_list2))})\w*\b\s*"
# '(not (?:secure|secured|fit|fitted)\w*\b)'

Delta_Claims['New_technician_verbatim'] = (Delta_Claims['technician_verbatim']
                                           .str.replace(pattern, '', regex=True)
                                          )

字符串
regex demo的
或者，不要使用单词列表，只extractall来自not_no_without_exceptions的单词（和join它们）：

import re

pattern = f'(not (?:{"|".join(map(re.escape, not_no_without_exceptions["not_exceptions"]))})\w*\b)'
# 'not (leak|para|param)\w*\b\s*'

Delta_Claims['New_technician_verbatim'] = (Delta_Claims['technician_verbatim']
                                           .str.extractall(pattern)[0]
                                           .groupby(level=0).agg(' '.join)
                                          )

型
regex demo的
输出量：

technician_verbatim New_technician_verbatim
0  not leak not secured not fitted not parameter  not secured not fitted

型

赞(0）回复(0）举报 2023-08-08

xmakbtuz2#

仅更改此行：-pattern = rf'\bnot\s+{escaped_i}\s*|\b{escaped_i}\s*'
对此：

pattern = rf'not\s+{escaped_i}\S*\s*|not\s+{escaped_i}\S*\s*\w+'

字符串

赞(0）回复(0）举报 2023-08-08

我来回答

regex Python中的正则表达式条件

2条答案

相关问题

热门标签

最新问答