pandas 循环通过.zip文件并在zip内打印excel长度

rvpgvaaj 于 2022-12-21 发布在其他

关注(0)|答案(1)|浏览(151)

我有麻烦循环通过我的. zip文件（其中5个）在我的目录，其中有多个工作表在他们。
1.我想计算每个. zip文件中的记录数（附加每张表的记录数）
现在我刚刚得到了下面输出的第一行。我确实认为我在df = df[:1]这一行有一个问题，因为我认为它只限于一行。有什么想法吗？
输出应该类似于-

代码：

# Import File Function
def read_excel_sheets(xls_path):
   """Read all sheets of an Excel workbook and return a single DataFrame"""
   xl = pd.ExcelFile(xls_path)
   df = pd.DataFrame()
   for idx, name in enumerate(xl.sheet_names):
        sheet = xl.parse(name, header=None, dtype=str, ignore_index=True)
        # Drop Empty Columns
        sheet.dropna(axis=1, how='all', inplace=True)
        # Add sheet name as column
        sheet['sheet'] = name.split(" ")[-1]
        # Appending Data
        if len(df.columns) >= len(sheet.columns):
            df = df.append(sheet, ignore_index=True, sort=True)
        else:
            df = sheet.append(df, ignore_index=True, sort=True)
        del sheet

   return df

# Process File Function
def process_files(list_of_files):
    df = pd.DataFrame()
    for file in list_of_files:
        # zip file handler
        zip = zipfile.ZipFile(file)
        # list available files in the container
        zfiles = zip.namelist()
        #extensions to process
        extensions = (".xls",".xlsx")
        # Importing to Dataframe
        for zfile in zfiles:
            if zfile.endswith(extensions):
                # For Row Count Check
                df_temp = read_excel_sheets(zip.open(zfile))
                df = df.append(df_temp, ignore_index=True, sort=True)
                df['Length'] = len(df)
                df['FileName'] = file
                df = df[['Length','FileName']]
                df = df[:1]
                
    return df         
                


input_location = r'O:\Stack\Over\Flow'
month_to_process = glob.glob(input_location + "\\2022 - 10\\*.zip")             
df = process_files(month_to_process)  
print(df)

pandas

来源：https://stackoverflow.com/questions/74837586/loop-through-zip-files-and-print-length-of-excels-inside-zip

1条答案

按热度按时间

nwsw7zdq1#

全局化路径中所有文件名包含“Test”的zip文件。循环文件和工作表创建一个字典。将字典转换为Pandasdf。
您必须更改文件路径和文件名：

import os
from io import BytesIO
from pathlib import Path
from zipfile import ZipFile

import pandas as pd

def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        with ZipFile(file, "r") as zipped:
            os.chdir(path)
            zipped.extractall()

        data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)

        row_counts = []
        for sheet in list(data_mapping.keys()):
            row_counts.append(len(data_mapping.get(sheet)))

        file_mapping.update({file: sum(row_counts)})

    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]

    return frame

path = f"{Path().home()}/Documents/"
zip_files = (str(x) for x in Path(path).glob("Test*.zip"))
df = process_files(zip_files)
print(df)

输出：

file_name  length
0  /Users/joeblow/Documents/Test2.xlsx.zip      16
1  /Users/joeblow/Documents/Test3.xlsx.zip      16
2  /Users/joeblow/Documents/Test4.xlsx.zip       4
3  /Users/joeblow/Documents/Test1.xlsx.zip      21

赞(0）回复(0）举报 2022-12-21

我来回答

pandas 循环通过.zip文件并在zip内打印excel长度

1条答案

相关问题

热门标签

最新问答