pandas 如何从文件名以日期开头的目录中只加载最近的文件?

myzjeezk  于 2022-12-02  发布在  其他
关注(0)|答案(2)|浏览(103)

我在一个名为的目录/文件夹中有文件:

  1. 2022-07-31_DATA_GVAX_ARPA_COMBINED.csv
  2. 2022-08-31_DATA_GVAX_ARPA_COMBINED.csv
  3. 2022-09-30_DATA_GVAX_ARPA_COMBINED.csv
    该文件夹将更新为每月的文件,格式与上述相同,例如:
  • 2022-10-31_DATA_GVAX_ARPA_COMBINED.csv
  • 2022-11-30_DATA_GVAX_ARPA_COMBINED.csv

我只想把最近一个月的.csv文件加载到Pandas数据框中,而不是所有的文件。我该怎么做呢(也许用glob)?
我已经看到这用于前缀使用:

dir_files = r'/path/to/folder/*'

dico={}

for file in Path(dir_files).glob('DATA_GVAX_COMBINED_*.csv'):
    dico[file.stem.split('_')[-1]] = file

max_date = max(dico)
pgpifvop

pgpifvop1#

您可以尝试以下操作:

import pandas as pd
from pathlib import Path

dir_files = r'/path/to/folder/*'

dico = {}

for file in Path(dir_files).glob('*DATA_GVAX_ARPA_COMBINED*.csv'):
    date_value = pd.to_datetime(file.name.split('_')[0], errors="coerce")
    if pd.notna(date_value):
        dico[date_value] = file

max_date = max(dico.keys())
filepath = dico[max_date]
print(f'{max_date} -> {filepath}')
# Prints:
#
# 2022-10-31 00:00:00 -> 2022-10-31_DATA_GVAX_ARPA_COMBINED.csv
mrzz3bfm

mrzz3bfm2#

用已知的感兴趣的文件的模式对目录进行全局搜索。

from glob import glob as GLOB
from os.path import join as JOIN, basename as BASENAME

def get_latest(directory):
    if all_files := list(GLOB(JOIN(directory, '*_DATA_GVAX_ARPA_COMBINED.csv'))):
        return sorted(all_files, key=BASENAME)[-1]

print(get_latest('/Users/Cobra'))

相关问题