regex 在第一个正则表达式行匹配中保留第二个正则表达式行

lvjbypge  于 11个月前  发布在  其他
关注(0)|答案(3)|浏览(90)

我在E:\Desktop\Linux_distro\asliiiii目录中有大量的txt列表文件,下面是我的一个文件的示例:

95
ROSA
139
96
Chakra
137
97
AV Linux
135
98
LibreELEC
134
99
Simplicity
131
100
Kodachi
130
20200301020449
79776361952441

字符串
现在我需要一个脚本,首先找到\d{14}正则表达式行,然后在找到的行中只保留20(?:0[0-9]|1[0-9]|20)[0-1][0-9]正则表达式行。
这意味着以下结果必须提供给我:

95
ROSA
139
96
Chakra
137
97
AV Linux
135
98
LibreELEC
134
99
Simplicity
131
100
Kodachi
130
20200301020449


我写了下面的python脚本,但我不知道为什么它不适合我的列表!

import os
import re

def process_file(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    # Find lines matching \d{14}
    regex_pattern_1 = re.compile(r'\d{14}')
    matching_lines = [line.strip() for line in lines if regex_pattern_1.search(line)]

    # Keep only matches of the second regex in the found lines
    regex_pattern_2 = re.compile(r'20(?:0[0-9]|1[0-9]|20)[0-1][0-9]\d{8}')
    filtered_lines = []
    for line in matching_lines:
        matches = regex_pattern_2.findall(line)
        filtered_lines.extend(matches)

    # Write the filtered lines back to the file
    with open(file_path, 'w') as file:
        file.write('\n'.join(filtered_lines))

def process_files_in_directory(directory_path):
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory_path, filename)
            process_file(file_path)

if __name__ == "__main__":
    directory_path = r'E:\Desktop\Linux_distro\asliiiii'
    process_files_in_directory(directory_path)
    print("Processing complete.")


但这个脚本提供了我以下的结果!!

20200301020449


这个脚本的问题在哪里?

wfypjpf4

wfypjpf41#

尝试以下操作。

matches = regex_pattern_2.findall(line[:6])

字符串
或者,调整 * 模式 * 以包含剩余的 *8**字符 *。

20(?:0[0-9]|1[0-9]|20)[0-1][0-9]\d{8}

2skhul33

2skhul332#

我的意思是,太多的人使用正则表达式来解决实际上并不需要它们的问题。

def process_file(fn):
    fin = open(fn)
    fout = open(fn+'.out','w')

    for line in fin:
        line = line.strip()
        print(line, file=fout)
        if len(line) == 14 and line.isdigit():
            break

    for line in fin:
        line = line.strip()
        if len(line) == 14 and line.isdigit() and line.startswith('20'):
            print(line, file=out)

process_file('x.txt')

字符串
现在,我假设检查“以'20'开头的14位数字”足以找到您的时间戳,但是如果您真的需要找到有效的日期,您可以在这里使用正则表达式。
请注意,我复制到一个新的文件与一个特殊的名称。你可以做一个deleterename在最后,如果你想。

mutmk8jj

mutmk8jj3#

以下脚本对我很好:

import os
import re

def process_file(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    # Keep lines that match the second regex or do not match any regex
    regex_pattern_2 = re.compile(r'20(?:0[0-9]|1[0-9]|20)[0-1][0-9]\d{8}')
    filtered_lines = [line.strip() for line in lines if regex_pattern_2.search(line) or not re.search(r'\d{14}', line)]

    # Write the filtered lines back to the file
    with open(file_path, 'w') as file:
        file.write('\n'.join(filtered_lines))

def process_files_in_directory(directory_path):
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory_path, filename)
            process_file(file_path)

if __name__ == "__main__":
    directory_path = r'E:\Desktop\Linux_distro\asliiiii'
    process_files_in_directory(directory_path)
    print("Processing complete.")

字符串

相关问题