python 如何在NLTK RegexpParser中使用精确的单词

olhwl3o2  于 2023-02-15  发布在  Python
关注(0)|答案(1)|浏览(131)
    • bounty将在5天后过期**。回答此问题可获得+50的声誉奖励。taga正在寻找来自声誉良好来源的答案

我想在NLTK RegexpParser的帮助下从文本中提取特定的短语。有没有办法在pos_tags中组合精确的单词?
例如,这是我的经文:

import nltk

text = "Samle Text and sample Text and text With University of California and Institute for Technology with SAPLE TEXT"

tokens = nltk.word_tokenize(text)
tagged_text = nltk.pos_tag(tokens)

regex = "ENTITY:{<University|Institute><for|of><NNP|NN>}"

# searching by regex that is defined
entity_search = nltk.RegexpParser(regex)
entity_result = entity_search.parse(tagged_text)
entity_result = list(entity_result)
print(entity_result)

Ofc,我有很多不同的单词组合,我想在我的"实体"正则表达式中使用,而且我有更长的文本。有什么方法可以让它工作吗?顺便说一句,我想让它与RegexpParser一起工作,我不想使用常规正则表达式。

dxxyhpgq

dxxyhpgq1#

这里有一个解决方案,它不需要您指定确切的单词,但仍然提取感兴趣的实体。({<N.*><IN><N.*>})匹配任何名词相关的标签<N.*>,后面跟着介词或从属连词标签<IN>,后面跟着另一个与名词相关的标签<N.*>。这是字符串的一般PoS标签模式,如“____大学”或“____研究所"。您可以将<N.*>更改为<NNP>,使其更加严格,仅匹配专有名词。有关PoS标记的详细信息,请参阅this tutorial

溶液#1

from nltk import word_tokenize, pos_tag, RegexpParser

text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text)  # Tokenize text
tagged_text = pos_tag(tokenized)  # Tag tokenized text with PoS tags

# Create custom grammar rule to find occurrences of a noun followed proposition or subordinating conjunction, followed by another noun (e.g. University of ___)
my_grammar = r"""
ENTITY: {<N.*><IN><N.*>}"""

# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
    cp = RegexpParser(grammar)
    parse_tree = cp.parse(pos_tagged_text)
    parse_tree.draw()  # Visualise parse tree
    return parse_tree

# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
    labels = []
    for line in grammar.splitlines()[1:]:
        labels.append(line.split(":")[0])
    return labels

# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
    matching_phrases = []
    for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
        # Get phrases only, drop PoS tags
        matching_phrases.append([leaf[0] for leaf in node.leaves()])
    return matching_phrases

text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)

for phrase in phrases:
    print(phrase)

输出

['University', 'of', 'California']
['Institute', 'for', 'Technology']

如果你真的需要捕捉精确单词的能力,你可以通过为你需要的每个单词定义自定义标签来实现。一个不需要训练你自己的自定义标签的简单解决方案如下:

溶液#2

from nltk import word_tokenize, pos_tag, RegexpParser

text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text)  # Tokenize text
tagged_text = pos_tag(tokenized)  # Tag tokenized text with PoS tags

# Define custom tags for specific words
my_specific_tagged_words = {
    "ORG_TYPE": ["University", "Institute"],
    "PREP": ["of", "for"]
}

# Create copy of tagged text to modify with custom tags
modified_tagged_text = tagged_text

# Iterate over tagged text, find the specified words and then modify the tags
for i, text_tag_tuple in enumerate(tagged_text):
    for tag in my_specific_tagged_words.keys():
        for word in my_specific_tagged_words[tag]:
            if text_tag_tuple[0] == word:
                modified_tagged_text[i] = (word, tag) # Modify tag for specific word

# Create custom grammar rule to find occurrences of ORG_TYPE tag, followed PREP tag, followed by another noun
my_grammar = r"""
ENTITY: {<ORG_TYPE><PREP><N.*>}"""

# Copy previously defined get_parse_tree, get_labels_from_grammar, get_phrases_using_custom_labels functions here...

text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)

for phrase in phrases:
    print(phrase)

输出

['University', 'of', 'California']
['Institute', 'for', 'Technology']

相关问题