Python NLP处理if语句不在停止词列表中

guicsvcw  于 2023-03-13  发布在  Python
关注(0)|答案(3)|浏览(115)

我正在使用NLP spacy库,我创建了一个函数来返回文本中的标记列表。

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

这个函数是不正确的,因为删除停止字不工作。一切都是好的,只有当我删除最后一个条件and not in stop_words
如何升级此函数以根据定义的列表删除停止词以及所有其他条件语句?

v8wbuo2f

v8wbuo2f1#

你的代码看起来很好,有一个小的变化
在elif末尾放置和不在stop_words中的str(word)

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)
iyr7buue

iyr7buue2#

您的条件写错了。您的最后一个elif等价于:

condC = not in stop_words
elif condA and condB and not in condC:
    ...

如果你试图执行这段代码,你会得到一个语法错误,要检查某个元素是否在某个可迭代对象中,你需要在关键字in的左边提供那个元素,你只需要写word

elif condA and condB and ... and str(word) not in stop_words:
   ...
rkkpypqq

rkkpypqq3#

您需要将stop_words添加到函数中,该函数将停止词列表作为输入,然后您需要修改向标记列表添加单词的条件,以检查单词是否在stop_words列表中

def preprocess_text_spacy(text, stop_words):
    nlp = spacy.load('en_core_web_sm')
    tokens = []
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.append(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.append(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.append(word.lower_)
    return tokens

样品:

text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)

输出:

['this', 'sample', 'text', 'to', 'demonstrate', 'function']

相关问题