python 计算句子中控制否定的单词数

我试图计算一些单词在句子中出现的次数，同时控制否定。在下面的例子中，我写了一个非常基本的代码，我计算了“w”在“txt”中出现的次数。然而，我无法控制否定，如“don’t”和/或“not”。

w = ["hello", "apple"]

for word in w:
    txt = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
    print(txt.count(word))

代码应该说它只找到“apple”次数而不是4次。所以，我想补充一下：如果，n个单词之前或之后的单词中的“w”有一个否定，则不计数，否则。
注意：这里的否定词是“不要”和“不”。
有人能帮我吗？
非常感谢你的帮助！

首先，在你考虑否定/否定之前，str.count可能没有做你所期望的事情。

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

text.count('apple') # Outputs: 4

但如果你这样做：

text = "The thief grappled the pineapples and ran away with a basket of apples"

text.count('apple') # Outputs: 3

如果你想计算单词，你需要先做一些标记化，把字符串变成一个字符串列表，例如。

from collections import Counter

import nltk
from nltk import word_tokenize

nltk.download('punkt')

text = "The thief grappled the pineapples and ran away with a basket of apples"

Counter(word_tokenize(text))['apple'] # Output: 0
Counter(word_tokenize(text))['apples'] # Output: 1

然后你需要问自己，当你想计算apple/apples出现的次数时，复数是否重要？如果是这样，那么你必须做一些词干分析或词形分解，Stemmers vs Lemmatizers
本教程可能会有所帮助：https://www.kaggle.com/code/alvations/basic-nlp-with-nltk
假设你采用了词元和分词器，并考虑了定义什么是“词”以及如何计数它们所需要的任何东西，你必须定义什么是否定，以及你最终想对计数做什么？
让我们一起去
我想把文本分解成“块”或子句，对一些对象/名词有积极和消极的情绪。
然后你必须用最简单的术语来定义负/正意味着什么
任何靠近焦点名词窗口的否定词，我们都认为是“否定的”，而在任何其他情况下，都是肯定的。
如果我们试着用最简单的方式来量化否定，你首先必须

确定焦点词，让我们取单词apple和
然后是窗口，让我们说5个单词之前和5个单词之后。

代码：

import nltk
from nltk import word_tokenize, ngrams

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

NEGATIVE_WORDS = ["don't", "do not", "not"]
# Add all the forms of tokenized negative words
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]

def count_negation(tokens):
    return sum(1 for word in tokens if word in NEGATIVE_WORDS)

for window in ngrams(word_tokenize(text), 5): 
  if "apple" in window or "apples" in window:
    print(count_negation(window), window)

[out]：

0 ('I', 'love', 'apples', ',', 'apple')
0 ('love', 'apples', ',', 'apple', 'are')
0 ('apples', ',', 'apple', 'are', 'my')
0 (',', 'apple', 'are', 'my', 'favorite')
0 ('apple', 'are', 'my', 'favorite', 'fruit')
0 ('do', "n't", 'really', 'like', 'apples')
0 ("n't", 'really', 'like', 'apples', 'if')
0 ('really', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'too')
1 ('I', 'do', 'not', 'like', 'apples')
1 ('do', 'not', 'like', 'apples', 'if')
1 ('not', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'immature')

问：但是，即使句子/从句在文本中只出现一次，`I do not like apples`也被计数3次，这不是一种过度计数吗？

是的，这是过度计数，所以它回到了这样一个问题，即计数否定的最终目标是什么？
如果最终目标是拥有一个情感分类器，那么我认为词汇方法可能不如最先进的语言模型，比如：

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

prompt=f"""Do I like apples or not?
QUERY:{text}
OPTIONS:
 - Yes, I like apples
 - No, I hate apples
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)

[out]：

Yes, I like apples

问：但如果我想解释为什么该模型假设了对苹果的正面/负面情绪，该怎么办？我该如何在不计算负面情绪的情况下解释？

答：说得好，这是一个解释输出的活跃研究领域，所以肯定的是，目前还没有明确的答案，但可以看看https://aclanthology.org/2022.coling-1.406

python 计算句子中控制否定的单词数

1条答案

问：但是，即使句子/从句在文本中只出现一次，`I do not like apples`也被计数3次，这不是一种过度计数吗？

问：但如果我想解释为什么该模型假设了对苹果的正面/负面情绪，该怎么办？我该如何在不计算负面情绪的情况下解释？

相关问题

热门标签

最新问答

python 计算句子中控制否定的单词数

1条答案

问：但是，即使句子/从句在文本中只出现一次，I do not like apples也被计数3次，这不是一种过度计数吗？

问：但如果我想解释为什么该模型假设了对苹果的正面/负面情绪，该怎么办？我该如何在不计算负面情绪的情况下解释？

相关问题

热门标签

最新问答

问：但是，即使句子/从句在文本中只出现一次，`I do not like apples`也被计数3次，这不是一种过度计数吗？