python 计算句子中控制否定的单词数

tyky79it  于 2023-03-21  发布在  Python
关注(0)|答案(1)|浏览(219)

我试图计算一些单词在句子中出现的次数,同时控制否定。在下面的例子中,我写了一个非常基本的代码,我计算了“w”在“txt”中出现的次数。然而,我无法控制否定,如“don’t”和/或“not”。

w = ["hello", "apple"]

for word in w:
    txt = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
    print(txt.count(word))

代码应该说它只找到“apple”次数而不是4次。所以,我想补充一下:如果,n个单词之前或之后的单词中的“w”有一个否定,则不计数,否则。
注意:这里的否定词是“不要”和“不”。
有人能帮我吗?
非常感谢你的帮助!

r3i60tvu

r3i60tvu1#

首先,在你考虑否定/否定之前,str.count可能没有做你所期望的事情。

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

text.count('apple') # Outputs: 4

但如果你这样做:

text = "The thief grappled the pineapples and ran away with a basket of apples"

text.count('apple') # Outputs: 3

如果你想计算单词,你需要先做一些标记化,把字符串变成一个字符串列表,例如。

from collections import Counter

import nltk
from nltk import word_tokenize

nltk.download('punkt')

text = "The thief grappled the pineapples and ran away with a basket of apples"

Counter(word_tokenize(text))['apple'] # Output: 0
Counter(word_tokenize(text))['apples'] # Output: 1

然后你需要问自己,当你想计算apple/apples出现的次数时,复数是否重要?如果是这样,那么你必须做一些词干分析或词形分解,Stemmers vs Lemmatizers
本教程可能会有所帮助:https://www.kaggle.com/code/alvations/basic-nlp-with-nltk
假设你采用了词元和分词器,并考虑了定义什么是“词”以及如何计数它们所需要的任何东西,你必须定义什么是否定,以及你最终想对计数做什么?
让我们一起去
我想把文本分解成“块”或子句,对一些对象/名词有积极和消极的情绪。
然后你必须用最简单的术语来定义负/正意味着什么
任何靠近焦点名词窗口的否定词,我们都认为是“否定的”,而在任何其他情况下,都是肯定的。
如果我们试着用最简单的方式来量化否定,你首先必须

  • 确定焦点词,让我们取单词apple和
  • 然后是窗口,让我们说5个单词之前和5个单词之后。

代码:

import nltk
from nltk import word_tokenize, ngrams

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

NEGATIVE_WORDS = ["don't", "do not", "not"]
# Add all the forms of tokenized negative words
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]

def count_negation(tokens):
    return sum(1 for word in tokens if word in NEGATIVE_WORDS)

for window in ngrams(word_tokenize(text), 5): 
  if "apple" in window or "apples" in window:
    print(count_negation(window), window)

[out]:

0 ('I', 'love', 'apples', ',', 'apple')
0 ('love', 'apples', ',', 'apple', 'are')
0 ('apples', ',', 'apple', 'are', 'my')
0 (',', 'apple', 'are', 'my', 'favorite')
0 ('apple', 'are', 'my', 'favorite', 'fruit')
0 ('do', "n't", 'really', 'like', 'apples')
0 ("n't", 'really', 'like', 'apples', 'if')
0 ('really', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'too')
1 ('I', 'do', 'not', 'like', 'apples')
1 ('do', 'not', 'like', 'apples', 'if')
1 ('not', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'immature')
问:但是,即使句子/从句在文本中只出现一次,I do not like apples也被计数3次,这不是一种过度计数吗?

是的,这是过度计数,所以它回到了这样一个问题,即计数否定的最终目标是什么?
如果最终目标是拥有一个情感分类器,那么我认为词汇方法可能不如最先进的语言模型,比如:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

prompt=f"""Do I like apples or not?
QUERY:{text}
OPTIONS:
 - Yes, I like apples
 - No, I hate apples
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)

[out]:

Yes, I like apples
问:但如果我想解释为什么该模型假设了对苹果的正面/负面情绪,该怎么办?我该如何在不计算负面情绪的情况下解释?

答:说得好,这是一个解释输出的活跃研究领域,所以肯定的是,目前还没有明确的答案,但可以看看https://aclanthology.org/2022.coling-1.406

相关问题