不确定这是一个“用Pandas过滤”的问题还是一个文本分析问题,但是:
给定df,
d = {
"item": ["a", "b", "c", "d"],
"report": [
"john rode the subway through new york",
"sally says she no longer wanted any fish, but",
"was not submitted",
"the doctor proceeded to call washington and new york",
],
}
df = pd.DataFrame(data=d)
df
导致
item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
c, "was not submitted"
d, "the doctor proceeded to call washington and new york"
和要匹配的术语列表:
terms = ["new york", "fish"]
根据terms
中的子字符串是否在report
列中找到,如何减少df以包含以下行,从而保留item
?
item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"
4条答案
按热度按时间oogrdqng1#
另一种可能的解决方案基于
numpy
:输出:
s6fujrry2#
从另一个答案here中提取:
您可以将
terms
更改为regex可用的单个字符串(即以|
分隔),然后使用df.Series.str.contains
。fumotvh33#
试试这个:
在正则表达式中使用单词边界将确保"fish"匹配,但"fishy"不匹配(作为示例)
输出:
2uluyalo4#
试试这个:
输出: