python 使用spaCy的Matcher模块合并命名实体

ee7vknir  于 2022-11-21  发布在  Python
关注(0)|答案(1)|浏览(164)
def match_patterns(cleanests_post):

    mark_rutte = [
    [{"LOWER": "mark", 'OP': '?'}, {"LOWER": "rutte", 'OP': '?'}],

    [{"LOWER": "markie"}]

    ]

    matcher.add("Mark Rutte", mark_rutte, on_match=add_person_ent)

    hugo_dejonge = [
    [{"LOWER": "hugo", 'OP': '?'}, {"LOWER": "de jonge", 'OP': '?'}]

    ]

    matcher.add("Hugo de Jonge", hugo_dejonge, on_match=add_person_ent)


    adolf_hitler = [
    [{"LOWER": "adolf", 'OP': '?'}, {"LOWER": "hitler", 'OP': '?'}]

    ]

    matcher.add("Adolf Hitler", adolf_hitler, on_match=add_person_ent)

    matches = matcher(cleanests_post)
    matches.sort(key = lambda x:x[1])

    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = cleanests_post[start:end]  # The matched span
        # print('matches', match_id, string_id, start, end, span.text)
        # print ('$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$')

    
    return (cleanests_post)


def add_person_ent(matcher, cleanests_post, i, matches):
        
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)

    match_id, start, end = matches[i]
    entity = Span(cleanests_post, start, end, label="PERSON")

    filtered = filter_spans(cleanests_post.ents) # When spans overlap, the (first) longest span is preferred over shorter spans.

    filtered += (entity,)

    cleanests_post = filtered

    return (cleanests_post)

 

with open(filepath, encoding='latin-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')

    next(reader, None) # Skip first row (= header) of the csv file

    dict_from_csv = {rows[0]:rows[2] for rows in reader} # creates a dictionary with 'date' as keys and 'text' as values
    #print (dict_from_csv)

    values = dict_from_csv.values()
    values_list = list(values)
    #print ('values_list:', values_list)

    people = []

    for post in values_list: # iterate over each post
       

        # Do some preprocessing here  

        clean_post = remove_images(post)

        cleaner_post = remove_forwards(clean_post)

        cleanest_post = remove_links(cleaner_post)

        cleanests_post = delete_breaks(cleanest_post)

        cleaned_posts.append(cleanests_post)

        cleanests_post = nlp(cleanests_post)

        cleanests_post = match_patterns(cleanests_post) 

        if cleanests_post.ents:
            show_results = displacy.render(cleanests_post, style='ent')
   

        # GET PEOPLE
        
        for named_entity in cleanests_post.ents:
            if named_entity.label_ == "PERSON":
                #print ('NE PERSON:', named_entity)
                people.append(named_entity.text)

    people_tally = Counter(people)

    df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
    print ('people:', df)

我使用spaCy来提取一系列电报组中提到的命名实体。我的数据是csv文件,其中有“日期”和“文本”列(每个帖子内容的字符串)。
为了优化我的输出,我想合并实体,如'Mark','Rutte','Mark Rutte','Markie'(和他们的小写形式),因为他们指的是同一个人。我的方法是使用spaCy内置的匹配器模块来合并这些实体。
在我的代码中,match_patterns()用于定义mark_rutte等模式,add_person_ent()用于将该模式作为实体附加到doc.ents(在我的例子中为cleanests_post.ents)。
脚本的顺序如下:

  • 打开csv文件,将电报日期作为带开环
  • 逐一遍历每个帖子并进行一些预处理
  • 在每个post上调用spaCy的内置nlp()函数来提取命名实体
  • 在每个帖子上调用我自己的match_patterns()函数来合并我在模式mark_rutte、hugo_dejonge和adolf_hitler中定义的实体
  • 最后,循环cleanests_post.ents中的实体,并将所有PERSON实体附加到people(= list)中,然后使用Counter()和panda生成每个已识别人员的排名

出错原因:看起来好像match_patterns()并添加人员()不起作用。我的输出与不调用match_patterns时的输出完全相同(),即“马克”、“马克”、"鲁特"、"鲁特“、”马克鲁特“、"马克鲁特”、“markie”仍然被分类为单独的实体。似乎覆盖cleanests_posts. ents时出现了问题。在add_person_ent中()我尝试过使用spaCy的filter_spans()来解决这个问题,但是没有成功。

相关问题