python pandas.Series.str.contains WHOLE WORD [duplicate]

esyap4oy  于 2023-09-29  发布在  Python
关注(0)|答案(5)|浏览(79)

此问题已在此处有答案

How to match a whole word with a regular expression?(4个答案)
4个月前关闭。
df(Pandas Dataframe)有三行。

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

我想提取包含“is”和“small”的行。
如果我做

df.col_name.str.contains("is|small", case=False)

然后它也抓住了“他的”--我不想这样。
下面的查询是正确的方法来捕捉整个词在df.系列?

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)
rsl1atfo

rsl1atfo1#

不,正则表达式/bis/b|/bsmall/b将失败,因为您使用的是/b,而不是\b,后者表示“字边界”。
把它改一下就能匹配了。我会建议使用

\b(is|small)\b

这个正则表达式更快一点,也更清晰一点,至少对我来说是这样。记住把它放在raw stringr"\b(is|small)\b")中,这样你就不必转义反斜杠了。

6jjcrrmo

6jjcrrmo2#

首先,您可能希望将所有内容转换为小写,删除标点符号和空格,然后将结果转换为一组单词。

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[{0}]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   {this, is, donald}
1         His hands are so small         {small, his, so, are, hands}
2  Why are his fingers so short?  {short, fingers, his, so, are, why}

你现在可以使用布尔索引来查看是否所有的目标词都在这些新词集中。

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}  False
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}   True
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False

要提取匹配行,请执行以下操作:

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

要使用布尔索引将所有这些都变成一个语句:

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[{0}]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]
13z8s7eq

13z8s7eq3#

"\bis\b|\bsmall\b"中,反斜杠\b在传递给正则表达式方法进行匹配/搜索之前就被解析为ASCII Backspace。有关更多信息,请查看this document about escape characters。本文件中提到,
当存在'r'或'R'前缀时,反斜杠后面的字符将不加更改地包含在字符串中,并且所有反斜杠都保留在字符串中。
因此,有两个选择-
1.使用r前缀

df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)

1.(或)转义\字符-

df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)
vulvrdjw

vulvrdjw4#

你的方法对我不起作用。我不知道为什么你不能使用逻辑运算符and(&),因为我认为这是你真正想要的。
这是一个愚蠢的方法,但它有效:

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)
svujldwt

svujldwt5#

在讨论的扩展中,我想在正则表达式中使用一个变量,如下所示:

df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]

如果我不输入\B\b,代码将返回所有包含Sudan和South Sudan的列。而当我使用“\B(location.loc [i]['country'])\b”时,它返回空的 Dataframe 。请告诉我正确的用法。

相关问题