python 如何在NLTK RegexpParser中使用精确的单词

olhwl3o2 于 2023-02-15 发布在 Python

关注(0)|答案(1)|浏览(131)

- bounty将在5天后过期**。回答此问题可获得+50的声誉奖励。taga正在寻找来自声誉良好来源的答案。

我想在NLTK RegexpParser的帮助下从文本中提取特定的短语。有没有办法在pos_tags中组合精确的单词？
例如，这是我的经文：

import nltk

text = "Samle Text and sample Text and text With University of California and Institute for Technology with SAPLE TEXT"

tokens = nltk.word_tokenize(text)
tagged_text = nltk.pos_tag(tokens)

regex = "ENTITY:{<University|Institute><for|of><NNP|NN>}"

# searching by regex that is defined
entity_search = nltk.RegexpParser(regex)
entity_result = entity_search.parse(tagged_text)
entity_result = list(entity_result)
print(entity_result)

Ofc，我有很多不同的单词组合，我想在我的"实体"正则表达式中使用，而且我有更长的文本。有什么方法可以让它工作吗？顺便说一句，我想让它与RegexpParser一起工作，我不想使用常规正则表达式。

来源：https://stackoverflow.com/questions/74630803/how-to-use-exact-words-in-nltk-regexpparser

1条答案

按热度按时间

这里有一个解决方案，它不需要您指定确切的单词，但仍然提取感兴趣的实体。（{<N.*><IN><N.*>}）匹配任何名词相关的标签<N.*>，后面跟着介词或从属连词标签<IN>，后面跟着另一个与名词相关的标签<N.*>。这是字符串的一般PoS标签模式，如“____大学”或“____研究所"。您可以将<N.*>更改为<NNP>，使其更加严格，仅匹配专有名词。有关PoS标记的详细信息，请参阅this tutorial。

溶液#1

from nltk import word_tokenize, pos_tag, RegexpParser

text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text)  # Tokenize text
tagged_text = pos_tag(tokenized)  # Tag tokenized text with PoS tags

# Create custom grammar rule to find occurrences of a noun followed proposition or subordinating conjunction, followed by another noun (e.g. University of ___)
my_grammar = r"""
ENTITY: {<N.*><IN><N.*>}"""

# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
    cp = RegexpParser(grammar)
    parse_tree = cp.parse(pos_tagged_text)
    parse_tree.draw()  # Visualise parse tree
    return parse_tree

# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
    labels = []
    for line in grammar.splitlines()[1:]:
        labels.append(line.split(":")[0])
    return labels

# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
    matching_phrases = []
    for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
        # Get phrases only, drop PoS tags
        matching_phrases.append([leaf[0] for leaf in node.leaves()])
    return matching_phrases

text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)

for phrase in phrases:
    print(phrase)

输出

['University', 'of', 'California']
['Institute', 'for', 'Technology']

如果你真的需要捕捉精确单词的能力，你可以通过为你需要的每个单词定义自定义标签来实现。一个不需要训练你自己的自定义标签的简单解决方案如下：

溶液#2

from nltk import word_tokenize, pos_tag, RegexpParser

text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text)  # Tokenize text
tagged_text = pos_tag(tokenized)  # Tag tokenized text with PoS tags

# Define custom tags for specific words
my_specific_tagged_words = {
    "ORG_TYPE": ["University", "Institute"],
    "PREP": ["of", "for"]
}

# Create copy of tagged text to modify with custom tags
modified_tagged_text = tagged_text

# Iterate over tagged text, find the specified words and then modify the tags
for i, text_tag_tuple in enumerate(tagged_text):
    for tag in my_specific_tagged_words.keys():
        for word in my_specific_tagged_words[tag]:
            if text_tag_tuple[0] == word:
                modified_tagged_text[i] = (word, tag) # Modify tag for specific word

# Create custom grammar rule to find occurrences of ORG_TYPE tag, followed PREP tag, followed by another noun
my_grammar = r"""
ENTITY: {<ORG_TYPE><PREP><N.*>}"""

# Copy previously defined get_parse_tree, get_labels_from_grammar, get_phrases_using_custom_labels functions here...

text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)

for phrase in phrases:
    print(phrase)

输出

['University', 'of', 'California']
['Institute', 'for', 'Technology']

赞(0）回复(0）举报 2023-02-15

相关问题

热门标签

Java query python Node 开发语言 request Util 数据库 Table 后端算法 Logger Message Element Parser

最新问答

xxl-job 安全组扫描到执行器端口服务存在信息泄露漏洞
回答(1) 发布于 4个月前
xxl-job 不能和nacos兼容？
回答(3) 发布于 4个月前
xxl-job 任务执行完后无法结束，日志一直转圈
回答(3) 发布于 4个月前
xxl-job-admin页面上查看调度日志样式问题
回答(1) 发布于 4个月前
xxl-job 参数512字符限制能否去掉
回答(1) 发布于 4个月前