基于列中子字符串的存在过滤Pandas Dataframe

ckx4rj1h 于 2023-02-14 发布在其他

关注(0)|答案(4)|浏览(108)

不确定这是一个“用Pandas过滤”的问题还是一个文本分析问题，但是：
给定df，

d = {
    "item": ["a", "b", "c", "d"],
    "report": [
        "john rode the subway through new york",
        "sally says she no longer wanted any fish, but",
        "was not submitted",
        "the doctor proceeded to call washington and new york",
    ],
}
df = pd.DataFrame(data=d)
df

导致

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
c, "was not submitted"
d, "the doctor proceeded to call washington and new york"

和要匹配的术语列表：

terms = ["new york", "fish"]

根据terms中的子字符串是否在report列中找到，如何减少df以包含以下行，从而保留item？

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"

pandas

来源：https://stackoverflow.com/questions/75390017/filtering-a-pandas-dataframe-based-presence-of-substrings-in-column

4条答案

按热度按时间

oogrdqng1#

另一种可能的解决方案基于numpy：

strings = np.array(df['report'], dtype=str)
substrings = np.array(terms)

index = np.char.find(strings[:, None], substrings)
mask = (index >= 0).any(axis=1)

df.loc[mask]

输出：

item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

赞(0）回复(0）举报 2023-02-14

s6fujrry2#

从另一个答案here中提取：
您可以将terms更改为regex可用的单个字符串（即以|分隔），然后使用df.Series.str.contains。

term_str = '|'.join(terms) # makes a string of 'new york|fish'
df[df['report'].str.contains(term_str)]

赞(0）回复(0）举报 2023-02-14

fumotvh33#

试试这个：
在正则表达式中使用单词边界将确保"fish"匹配，但"fishy"不匹配（作为示例）

m = df['report'].str.contains(r'\b({})\b'.format(r'|'.join(terms)))

df2 = df.loc[m]

输出：

item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

赞(0）回复(0）举报 2023-02-14

2uluyalo4#

试试这个：

df[df['report'].apply(lambda x: any(term in x for term in terms))]

输出：

item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

赞(0）回复(0）举报 2023-02-14

我来回答

基于列中子字符串的存在过滤Pandas Dataframe

4条答案

相关问题

热门标签

最新问答