pandas 在文件名中查找相似时间

zdwk9cvp  于 2023-06-20  发布在  其他
关注(0)|答案(2)|浏览(104)

我有Excel文件,其中提供了特定的时间,例如17:40,18:15,10:11的特定日期,我有文件夹,其中multiplie文件存储广告,其中一些有“类似的时间”的名称,例如
XXXXXMCCAAS_17_43_22_Timecheck.csv
因此,通过查看文件夹,我必须找到多个文件,其中有一些类似于17:40的名称,所以它将是17_43_22
有没有什么方法可以自动打印与我的特定“模式”相匹配的文件?
我的第一个镜头将去文件使用正则表达式,并假设让说10-15分钟的窗口
但是有没有更好的方法呢?

djmepvbi

djmepvbi1#

你可以使用Python的内置库来完成这个任务。下面是一个使用os、re、pandas和datetime库的Python脚本。此脚本从Excel文件中读取时间,对其进行格式化,然后循环访问指定目录中的文件,在定义的时间窗口内查找匹配项。

import os
import re
import pandas as pd
from datetime import datetime, timedelta

# Read the Excel file
df = pd.read_excel('times.xlsx')  # adjust this to your file path and name

# Convert the times to the expected format and create a list
times = df['Time'].dt.strftime('%H_%M_%S').tolist()  # adjust 'Time' to your column name

# Define a function to check if a filename time is within a time window
def within_window(filename_time, excel_time, window=15):
    FMT = '%H_%M_%S'
    excel_time_obj = datetime.strptime(excel_time, FMT)
    
    # Add seconds if missing
    if excel_time_obj.second == 0:
        excel_time_obj = excel_time_obj + timedelta(seconds=0)
        
    tdelta = datetime.strptime(filename_time, FMT) - excel_time_obj
    if abs(tdelta.total_seconds()) <= window * 60:  # compare in seconds
        return True
    return False

# Loop through the files in the directory
for filename in os.listdir('/path/to/your/files'):  # adjust this to your directory path
    # Use regex to extract the time from the filename
    match = re.search(r'\d{2}_\d{2}_\d{2}', filename)
    if match:
        filename_time = match.group()
        # Check if the filename time is within the window for any excel times
        for time in times:
            if within_window(filename_time, time):
                print(filename)
                break  # if a match is found, no need to check the other times

确保根据您的设置调整Excel文件路径、目录路径和列名。

cs7cruho

cs7cruho2#

对于你问题的第一部分,我会这样做:

from pathlib import Path
import pandas as pd

times = (
    pd.read_excel("tmp/file.xlsx", usecols="A", dtype="str")
    .squeeze().str[:4].replace(":", "_", regex=True).tolist()
) #['17_4', '18_1', '10_1']

matches = []
for t in times:
    matches.extend(Path("tmp").glob(f"*_{t}*"))

输出:

for m in matches:
    print(m)

tmp\XXXXXMCCAAS_17_43_22_Timecheck.csv
tmp\XXXXXMCCAAS_10_10_34_Timecheck.csv
tmp\XXXXXMCCAAS_10_11_23_Timecheck.csv

使用的电子表格和树:

tmp/
┣━━ file.xlsx # <-- doesn't need to be in the same location
┣━━ XXXXXMCCAAS_00_03_45_Timecheck.csv
┣━━ XXXXXMCCAAS_00_25_45_Timecheck.csv
┣━━ XXXXXMCCAAS_00_45_43_Timecheck.csv
┣━━ XXXXXMCCAAS_01_16_35_Timecheck.csv
┣━━ XXXXXMCCAAS_03_01_51_Timecheck.csv
┣━━ XXXXXMCCAAS_03_04_12_Timecheck.csv
┣━━ XXXXXMCCAAS_04_47_22_Timecheck.csv
┣━━ XXXXXMCCAAS_09_27_07_Timecheck.csv
┣━━ XXXXXMCCAAS_10_10_34_Timecheck.csv
┣━━ XXXXXMCCAAS_10_11_23_Timecheck.csv
┣━━ XXXXXMCCAAS_13_28_15_Timecheck.csv
┣━━ XXXXXMCCAAS_14_09_30_Timecheck.csv
┣━━ XXXXXMCCAAS_14_59_41_Timecheck.csv
┣━━ XXXXXMCCAAS_16_33_31_Timecheck.csv
┣━━ XXXXXMCCAAS_17_24_02_Timecheck.csv
┣━━ XXXXXMCCAAS_17_43_22_Timecheck.csv
┣━━ XXXXXMCCAAS_18_24_16_Timecheck.csv
┣━━ XXXXXMCCAAS_19_19_42_Timecheck.csv
┣━━ XXXXXMCCAAS_20_42_18_Timecheck.csv
┗━━ XXXXXMCCAAS_21_21_25_Timecheck.csv

相关问题