regex 解析脚本中的迭代:值保持不变

arknldoa  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(94)

我目前正在开发一个解析器,它迭代包含收益电话会议记录的.txt文件。目标是提取CEO所说的部分。提供的代码片段是负责提取各种信息的较大脚本的一部分,例如电话会议的日期和公司。您可以在这里找到完整的记录,包括正则表达式:https://regex101.com/r/mhKevB/1

presentation_part = """
--------------------------------------------------------------------------------
Inge G. Thulin,  3M Company - Chairman, CEO & President    [3]
--------------------------------------------------------------------------------

          Thank you, Bruce, and good morning, everyone. Coming off a strong 2017, our team opened the new year with broad-based organic growth across all business groups. We expanded margins and posted a double-digit increase in earnings per share while continuing to invest in our business and return cash to our shareholders.
"""

ceos_lname_clean = ['Thulin', 'Davis']

try:
    ceos_speaches_pres = []
    if len(ceos_lname_clean) != 0: 
        for lname in ceos_lname_clean:
            ceo_pattern = fr'(?m){lname}.*?(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)' #Alternatives pattern wo neben dem Begriff CEO auch auf den Namen des CEO gematched wird
            ceo_textparts_pres = re.findall(ceo_pattern, presentation_part, re.DOTALL | re.IGNORECASE)
            ceo_speech_presentation = " ".join(ceo_textparts_pres)
            ceos_speaches_pres.append(ceo_speech_presentation)
        #Overall_dict[folder][comp_path]["CEO Presentation Speech"] = ceos_speaches_pres ##Add the text to a dict

    else: ##try for COO in case ceos_lname_clean is empty
        coos_speaches_pres = [] 
        for coo_lname in coos_lname_clean:
            coo_pattern = fr'(?m){coo_lname}.*?(?:COO|Chief Operating Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)' #Alternatives pattern wo neben dem Begriff COO auch auf den Namen des COO gematched wird
            coo_textparts_pres = re.findall(coo_pattern, presentation_part, re.DOTALL | re.IGNORECASE)
            coo_speech_presentation = " ".join(coo_textparts_pres)
            coos_speaches_pres.append(coo_speech_presentation)
        #Overall_dict[folder][comp_path]["COO Presentation Speech"] = coos_speaches_pres ##Add the text to a dict
except:
    print("PROBLEM")

字符串
所提供的代码片段成功地提取了Thulin所说的文本。然而,当集成到整个脚本中时,出现了一个问题:ceo_textparts_pres保留了上一次迭代的值。也就是说,即使Davis的ceo_textparts_pres应该保持为空,它也保留了Thulin所说的文本。
我花了一整天的时间来解决这个问题,但没有成功,并越来越沮丧。不幸的是,整个脚本太广泛,不能在这里发布,但即使是最小的提示或建议,可能会导致这个问题,将不胜感激。
提前感谢您的帮助。

kmpatx3s

kmpatx3s1#

你的正则表达式模式在姓氏{lname}和排名部分(CEO|Ch.Ex.Of.)之间,
也就是说,.*?匹配多行,因为re.DOTALL标志。导致Davis和Thulin都匹配介绍部分。我建议不要对整个模式使用re.DOTALL标志,而是使用(?s:.*)为特定部分打开它,或者像这样显式匹配换行符:(?:\n|.)*
为了演示,我在下面添加了一个带有两个模式的测试用例。注解掉的行使用(?-e:.*?)而不是.*?为该部分禁用了DOTALL。并且不匹配坏的情况。

import re

presentation_part = """
today Davis
met miss Thulin
they were both CEO
on day number [3]
- bad case"""

ceos_lname_clean = ["Thulin", "Davis"]

ceos_speaches_pres = []
for lname in ceos_lname_clean:
    ceo_pattern = rf"(?m){lname}.*?(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)"
    # ceo_pattern = rf"(?m){lname}(?-s:.*?)(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)"
    ceo_textparts_pres = re.findall(
        ceo_pattern, presentation_part, re.DOTALL | re.IGNORECASE
    )
    ceo_speech_presentation = " ".join(ceo_textparts_pres)
    ceos_speaches_pres.append(ceo_speech_presentation)

print(ceos_speaches_pres)

字符串

相关问题