pandas 将嵌套框架分解为多个块以形成循环

aelbi1ox  于 12个月前  发布在  其他
关注(0)|答案(3)|浏览(127)

我用的是循环(that was answered in this question)迭代打开多个csv文件,将它们转置,并将它们连接成一个大的文件框架。每个csv文件为15 mb,超过10,000行。有超过1000个文件。我发现前50个循环发生在几秒钟内,但每个循环需要一分钟。我不介意让我的电脑过夜,但我可能需要我担心这样做多次,它会指数级地变慢。有没有更有效的内存方法来做到这一点,比如将df分成50行的块,然后在最后将它们连接起来?
在下面的代码中,df是一个包含1000行的框架,其中的列表示文件夹和文件名。

merged_data = pd.DataFrame()
 count = 0
 for index, row in df.iterrows():
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = os.path.join(root_path, folder_name, file_name)
    file_data = pd.read_csv(file_path, names=['Case', f'{folder_name}_{file_name}'], sep='\t')
    file_data_transposed = file_data.set_index('Case').T.reset_index(drop=True)
    file_data_transposed.insert(loc=0, column='folder_file_id', value=str(folder_name+'_'+file_name))
    merged_data = pd.concat([merged_data, file_data_transposed], axis=0, ignore_index=True)
    count = count + 1
    print(count)

字符串

0qx6xfy6

0qx6xfy61#

如果您正在处理大型数据集并希望探索并行化,您可以考虑使用Python中的concurrent.futures模块进行多处理。这样,每个进程可以同时处理CSV文件子集的阅读和处理。

import os
import pandas as pd
from concurrent.futures import ProcessPoolExecutor

def process_file(row):
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = os.path.join(root_path, folder_name, file_name)

    file_data = pd.read_csv(file_path, names=['Case', f'{folder_name}_{file_name}'], sep='\t')
    file_data_transposed = file_data.set_index('Case').T.reset_index(drop=True)
    file_data_transposed.insert(loc=0, column='folder_file_id', value=str(folder_name+'_'+file_name))

    return file_data_transposed

root_path = r'G:\path'

# Adjust the number of processes based on your system's capabilities
num_processes = 4  # You can experiment with different values

with ProcessPoolExecutor(max_workers=num_processes) as executor:
    results = list(executor.map(process_file, df.itertuples(index=False)))

merged_data = pd.concat(results, axis=0, ignore_index=True)

字符串

cl25kdpy

cl25kdpy2#

代码慢的原因是因为你在循环中使用了concat。你应该在python字典中收集数据,然后在最后执行一个concat
只有很少的改进:

import pathlib
import pandas as pd

root_path = pathlib.Path('root')  # use pathlib instead of os.path

data = {}
# use enumerate rather than create an external counter
for count, (_, row) in enumerate(df.iterrows(), 1):
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = root_path / folder_name / file_name
    folder_file_id = f'{folder_name}_{file_name}'

    file_data = pd.read_csv(file_path, header=None, sep='\t',
                            names=['Case', folder_file_id],
                            memory_map=True, low_memory=False)
    data[folder_file_id] = file_data.set_index('Case').squeeze()
    print(count)

merged_data = (pd.concat(data, names=['folder_file_id'])
                 .unstack('Case').reset_index())

字符串
输出量:

>>> merged_data
Case       folder_file_id       0       1       2       3       4
0     folderA_file001.txt  1234.0  5678.0  9012.0  3456.0  7890.0
1     folderB_file002.txt  4567.0  8901.0  2345.0  6789.0     NaN


输入数据:

>>> df
   File ID    File Name
0  folderA  file001.txt
1  folderB  file002.txt

>>> cat root/folderA/file001.txt
0   1234
1   5678
2   9012
3   3456
4   7890

>>> cat root/folderB/file002.txt
0   4567
1   8901
2   2345
3   6789


多线程版本:

from concurrent.futures import ThreadPoolExecutor
import pathlib
import pandas as pd

root_path = pathlib.Path('root')

def read_csv(args):
    count, row = args  # expand arguments
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = root_path / folder_name / file_name
    folder_file_id = f'{folder_name}_{file_name}'

    file_data = pd.read_csv(file_path, header=None, sep='\t',
                            names=['Case', folder_file_id],
                            memory_map=True, low_memory=False)
    print(count)
    return folder_file_id, file_data.set_index('Case').squeeze()

with ThreadPoolExecutor(max_workers=2) as executor:
    batch = enumerate(df[['File ID', 'File Name']].to_dict('records'), 1)
    data = executor.map(read_csv, batch)

merged_data = (pd.concat(dict(data), names=['folder_file_id'])
                 .unstack('Case').reset_index())

ljsrvy3e

ljsrvy3e3#

检查这个代码。

import pandas as pd
import os

merged_data_list = []

for index, row in df.iterrows():
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = os.path.join(root_path, folder_name, file_name)
    
    file_data = pd.read_csv(file_path, names=['Case', f'{folder_name}_{file_name}'], sep='\t')
    file_data_transposed = file_data.set_index('Case').T.reset_index(drop=True)
    file_data_transposed.insert(loc=0, column='folder_file_id', value=str(folder_name+'_'+file_name))
    
    merged_data_list.append(file_data_transposed)

# Concatenate all DataFrames in the list at once
merged_data = pd.concat(merged_data_list, axis=0, ignore_index=True)

字符串

相关问题