python 带空格的名词短语

doinxwow  于 2023-01-19  发布在  Python
关注(0)|答案(5)|浏览(123)

我如何使用spacy从文本中提取名词短语?
我指的不是词性标签。在文档中我找不到任何关于名词短语或规则解析树的东西。

polhcujo

polhcujo1#

如果你想要基本NP,即没有并列结构、介词短语或关系从句的NP,你可以在Doc和Span对象上使用noun_chunks迭代器:

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(u'The cat and the dog sleep in the basket near the door.')
>>> for np in doc.noun_chunks:
>>>     np.text
u'The cat'
u'the dog'
u'the basket'
u'the door'

如果你还需要别的东西,最好的方法是遍历句子中的单词,并考虑句法上下文来确定这个单词是否支配你想要的短语类型,如果是,就生成它的子树:

from spacy.symbols import *

np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield word.subtree
xmjla07d

xmjla07d2#

import spacy
nlp = spacy.load("en_core_web_sm")
doc =nlp('Bananas are an excellent source of potassium.')
for np in doc.noun_chunks:
    print(np.text)
'''
  Bananas
  an excellent source
  potassium
'''

for word in doc:
    print('word.dep:', word.dep, ' | ', 'word.dep_:', word.dep_)
'''
  word.dep: 429  |  word.dep_: nsubj
  word.dep: 8206900633647566924  |  word.dep_: ROOT
  word.dep: 415  |  word.dep_: det
  word.dep: 402  |  word.dep_: amod
  word.dep: 404  |  word.dep_: attr
  word.dep: 443  |  word.dep_: prep
  word.dep: 439  |  word.dep_: pobj
  word.dep: 445  |  word.dep_: punct
'''

from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj])
print('np_labels:', np_labels)
'''
  np_labels: {416, 422, 429, 430, 439}
'''

https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/

def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield(word.dep_)

iter_nps(doc)
'''
  <generator object iter_nps at 0x7fd7b08b5bd0>
'''

## Modified method:
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            print(word.text, word.dep_)

iter_nps(doc)
'''
  Bananas nsubj
  potassium pobj
'''

doc = nlp('BRCA1 is a tumor suppressor protein that functions to maintain genomic stability.')
for np in doc.noun_chunks:
    print(np.text)
'''
  BRCA1
  a tumor suppressor protein
  genomic stability
'''

iter_nps(doc)
'''
  BRCA1 nsubj
  that nsubj
  stability dobj
'''
b4lqfgs4

b4lqfgs43#

你也可以从这样的句子中得到名词:

import spacy
    nlp=spacy.load("en_core_web_sm")
    doc=nlp("When Sebastian Thrun started working on self-driving cars at "
    "Google in 2007, few people outside of the company took him "
    "seriously. “I can tell you very senior CEOs of major American "
    "car companies would shake my hand and turn away because I wasn’t "
    "worth talking to,” said Thrun, in an interview with Recode earlier "
    "this week.")
    #doc text is from spacy website
    for x in doc :
    if x.pos_ == "NOUN" or x.pos_ == "PROPN" or x.pos_=="PRON":
    print(x)
    # here you can get Nouns, Proper Nouns and Pronouns
xxhby3vn

xxhby3vn4#

如果你想更精确地指定你想提取哪种名词短语,你可以使用textacy'smatches函数。你可以传递任何POS标记的组合。例如,

textacy.extract.matches(doc, "POS:ADP POS:DET:? POS:ADJ:? POS:NOUN:+")

将返回前面有介词以及限定词和/或形容词(可选)的任何名词。
Textacy是建立在spacy之上的,所以它们应该完美地配合在一起。

jutyujz0

jutyujz05#

from spacy.en import English可能会给予您一个错误
没有名为“spacy.en”的模块
所有语言数据已移至spacy2.0+中的子模块spacy.lang
请使用spacy.lang.en import English
然后按照@syllogism_ answered执行所有剩余步骤

相关问题