python 命名实体识别，实体由其他词分割

qv7cva1a 于 2023-03-06 发布在 Python

关注(0)|答案(1)|浏览(124)

我有一个法律的文档，我想在其中自动标识对另一个法律文档的引用。
这些文档的结构类似于以下虚拟示例：

非常重要的第12/34/56号条约第85条第6款明确指出，你应该研究另一个重要的第56/78号条约，特别是涉及第1条a款和第2条的部分。

我的目标是识别对其他文档的引用。一开始我使用regex，效果相对不错。然而，就像上面的例子一样，我有时需要根据上下文识别信息。因此，我目前正在考虑使用spacy识别（嵌套）命名实体来解决这个问题。
现在的问题是：如前所述，我想确定同一文档的其他段落或其他文档的段落的链接。在上面的示例中，只有外部文档的链接，即以下3个链接：
1.非常重要的第12/34/56号条约第85条第6款
1.另一项重要条约第56/78号
1.第2条另一项重要条约第56/78号
通常，当用空格标记NER的数据时，用开始索引和结束索引以及要标识的相应实体类型来标记实体。然而，在本例中，我要提取的链接实际上会被拆分到多个部分，因此实际上需要类似start:ignore_start+_ignore_end:end的内容。
我想过把实体标记为article，paragraph，letter和document_title。但是，那样我就不能把信息放回一起了。
我偶然发现了嵌套命名实体识别，但我不确定它是否真的能帮助我。您认为数据必须如何标记才能解决这个问题？
编辑：关于预期的输出，我需要能够告诉，哪些文件，文章和段落，我需要链接。因此，预期的输出可以是例如一个字典列表：

[
{
  'article': 85,
  'paragraph': 6,
  'letter': None,
  'document_title': 'Very Important Treaty No. 12/34/56'
},
{
  'article': 1,
  'paragraph': None,
  'letter': 'a',
  'document_title': 'Another Important Treaty No. 56/78'
},
{
  'article': 2,
  'paragraph': None,
  'letter': None,
  'document_title': 'Another Important Treaty No. 56/78'
}
]

我希望这能澄清事实。

python

来源：https://stackoverflow.com/questions/75619751/named-entity-recognition-with-entities-split-by-other-words

1条答案

按热度按时间

u0sqgete1#

目前还不清楚您到底需要什么作为结果：最终文本或字典/变量集（以及，输出中是否需要Letter）。尽管如此，请考虑以下问题：

import re

def documents_identification(input_str):

    regex = r"(Article\s+(?P<article>\d+)(\s*|,|$))?" \
        + r"(Letter\s+(?P<letter>[a-z]+)(\s*|,|$))?" \
        + r"(Paragraph\s+(?P<paragraph>\d+)(\s*|,|$))?" \
        + r"(?P<document_title>(?<=\s)((Another|of|Very|Important|Treaty)\s+)+No\.\s+[\d/]+(?:\s*|,|$))?"
    regex_flags = re.I + re.M + re.S

    # input_str += ' ' # sometimes simplier than "(?:\s*|,|$)" etc. at the end

    article = letter = paragraph = document_title = result = last = ''
    article_paragraph_used = False

    for m in re.finditer(regex, input_str, regex_flags):
        if (m.group('article')):
            article = "Article " + m.group('article') + ' '
            letter = ''
        if (m.group('letter')):
            letter = "Letter " + m.group('letter') + ' '
        if (m.group('paragraph')):
            paragraph = "Paragraph " + m.group('paragraph') + ' '
        if (m.group('document_title')):
            if (document_title and article_paragraph_used):
                article = letter = paragraph = ''
            document_title = m.group('document_title')
        
        new = article + letter + paragraph + document_title
        if (new!=last and document_title and article):
            article_paragraph_used = True
            last = new
            result += new + '\n'
    
    return result

str = 'Article 85 Paragraph 6 of Very Important Treaty No. 12/34/56 explicitly states you should do your research on Another Important Treaty No. 56/78, especially when it comes to Article 1 Letter a and Article 2'

str = documents_identification(str)

print(str)

结果：

Article 85 Paragraph 6 of Very Important Treaty No. 12/34/56
Article 1 Letter a Another Important Treaty No. 56/78
Article 2 Another Important Treaty No. 56/78

赞(0）回复(0）举报 2023-03-06

我来回答

python 命名实体识别，实体由其他词分割

1条答案

相关问题

热门标签

最新问答