pandas 使用Python临时文件和Zipfiles进行资源管理

wtlkbnrh  于 2023-09-29  发布在  Python
关注(0)|答案(1)|浏览(70)

我需要创建一个子例程,循环遍历outlook消息集合,打开附件,并将zip文件夹中的任何表格数据提取到pandas数据框中。为了获取表格数据,我创建了一个名为zip_to_dfs的函数,它接受outlook MailItem附件作为参数。

#function to extract tabluar data within zip file to pandas dataframe. returns dictionary object(key=filename;value=pandas df)
import pandas as pd, zipfile, tempfile, os

def zip_to_dfs(attachment, extract_fn=None):
    #returns diciontary object with filename for key and dataframes from attached files as values
    df_objects = {}
    tmp=tempfile.TemporaryFile().name
    attachment.SaveAsFile(tmp)
    if zipfile.is_zipfile(tmp)==True:
        zf = zipfile.ZipFile(tmp)
        #below subroutine could be made to separate function (read tablular to df) to make more readable
        for file in zf.infolist():
            extension = os.path.splitext(file.filename)[1]
            if extension in ['.xls','.xlsx','.xlsm']:
                temp_df = pd.read_excel(zf.open(file.filename), header=None)
                df_objects.update({file.filename:temp_df})
            elif file.filename.endswith(".csv"):
                temp_df = pd.read_csv(zf.open(file.filename), header=None)
                df_objects.update({file.filename:temp_df})
            else:
                raise NotImplementedError('Unexpected filetype: '+str(file.filename))
    else: 
        raise NotImplementedError('Expected zip file')
    return(df_objects)

该函数按预期工作,但可能效率不高。有人使用过tempfile或zip文件库吗?如果是这样,你知道Zipfile和TemporaryFile方法是否会自动清理吗?或者这些文件在磁盘上是打开的?您是否发现这种方法存在其他明显的问题?
编辑的代码版本:

def zipattach_to_dfs(attachment, extract_fn=None):
    #evaluates zip file attachments and returns dictionary with file name as key and dataframes as values
    df_objects = {}
    with NamedTemporaryFile(suffix='.tmp', delete=False) as tmp:
        attachment.SaveAsFile(tmp.name)
        zf = ZipFile(tmp)
        for file in zf.infolist():
            datetime = (file.date_time)
            key = (f'{file.filename}({datetime[0]}-{datetime[1]}-{datetime[2]})')
            if isexcel(file) ==True:
                temp_df = pd.read_excel(zf.open(file.filename), header=None)
                df_objects.update({key:temp_df})
            elif file.filename.endswith(".csv"):
                temp_df = pd.read_csv(zf.open(file.filename), header=None)
                df_objects.update({key:temp_df})
            else:
                raise NotImplementedError('Unexpected filetype: '+str(file.filename))
    return (df_objects)
yx2lnoni

yx2lnoni1#

ZipFile也支持 with 语句。下面是我根据你的代码提出的建议:

def zip_to_dfs(attachment, extract_fn=None): # extract_fn ?
    '''
    Returns a dictionary object with filename for key
    and dataframes from attached files as values.
    '''
    with NamedTemporaryFile(delete=False) as tmp:
        attachment.SaveAsFile(tmp.name)
        
        if is_zipfile(tmp):
            with ZipFile(tmp) as zf:
                for file in zf.infolist():
                    fn, exte = file.filename.rsplit(".", 1)
                    key = (f'{fn} ({"-".join(map(str, file.date_time[:3]))})')
                    
                    if exte in {'xls', 'xlsx', 'xlsm', 'csv'}:
                        df_objects = {}
                        with zf.open(file) as zip_file:
                            if exte == 'csv':
                                df = pd.read_csv(zip_file, header=None)
                            else:
                                df = pd.read_excel(zip_file, header=None)
                            df_objects[key] = df
                    else:
                        raise NotImplementedError(
                            'Unexpected filetype: ' + file.filename
                        )
                return df_objects

调用这个函数看起来像下面这样:

from win32com.client import Dispatch

outlook = Dispatch("Outlook.Application").GetNamespace("MAPI")
folder = outlook.Folders("[email protected]").Folders("Inbox")

out = pd.concat(
    [v.assign(date=k) for item in folder.Items for att in item.Attachments
     if zip_to_dfs(att) for k,v in zip_to_dfs(att).items()
    ]
)  # this will consolidate all the dfs in a single one

相关问题