regex Python正则表达式不介意字母之间的间距

pwuypxnk 于 2023-04-07 发布在 Python

关注(0)|答案(1)|浏览(81)

我正在写一个脚本，以便使用Python正则表达式在许多PDF文档中搜索特定的单词。
我遇到的问题是，有些单词没有正确读入，因此无法匹配，例如，我在要搜索的单词列表中存储了单词“Extremalproblem”，但当PDF读入时，单词显示为“Extre malproblem”（e和m之间的空格），这就是为什么没有创建匹配。
其他单词也被错误地读取（在实际上不属于的地方有额外的间距）。老实说，我不知道如何确保当我正在寻找的单词之间有空格时进行匹配。所以匹配应该与单词间距无关。有选择吗？

# import packages
import PyPDF2
import re
import os, sys

dirs_list=[]
for root, dirs, files in os.walk(".", topdown=False):
    for name in dirs:
        dirs_list.append(dirs)

dirs_list=dirs_list[-1]
dirs_list.pop(0)

for k in dirs_list:
    data_names=os.listdir(k)
    data_names.pop(0)
    
    #counter=0
    for j in data_names:
        # open the pdf file
        reader = PyPDF2.PdfReader(os.path.join(k, j))

        # define key terms
        strings = [['Grosses Einmaleins', 'Einmaleins'],
                   ['Extremwertprobleme', 'Extremalprobleme', 'Extremwerte','Optimierung', 'Optimierungsprobleme'], 

        
        total=len(strings)
        
        #print(total)

        counter=0
        # extract text and do the search
        for page in reader.pages:
            for i in strings:
                for s in i:
                    text = page.extract_text() 
                    print(text)
                    #re.IGNORECASE: Gross- und Kleinschreibung egal
                    res_search = re.search(s, text, re.IGNORECASE)
                    #print(res_search)

                    if res_search != None:
                        counter+=1
                        #Wort des Matchs löschen, dass er nicht mehrere Male gezählt wird.
                        #print(s)
                        strings.remove(i)
                        break
                    
        print(j, counter/total)

regex

来源：https://stackoverflow.com/questions/75840411/python-regular-expression-dont-mind-spacing-between-letters

1条答案

按热度按时间

wqnecbli1#

快速修复是删除所有空格

替换

res_search = re.search(s, text, re.IGNORECASE)

与

res_search = re.search(s.replace(" ",""), text.replace(" ",""), re.IGNORECASE)

这应该可以解决您所提出的问题。
然而，它会导致假阳性匹配。例如，如果你正在搜索“car pet”，而网页包含“carpet”，那将是一个匹配。但如果你能忍受这个限制，它应该能完成这项工作。

赞(0）回复(0）举报 2023-04-07

我来回答

regex Python正则表达式不介意字母之间的间距

1条答案

快速修复是删除所有空格

相关问题

热门标签

最新问答