将包含多个表格的CSV文件拆分为不同的panda dataFrame(Python)

oxalkeyp  于 2022-12-06  发布在  Python
关注(0)|答案(1)|浏览(122)

我有多个CSV文件,其中有多个表格的格式,由换行符分隔。
示例:

Technology  C_inv [MCHF/y]  C_maint [MCHF/y]
NUCLEAR 70.308020   33.374568
HYDRO_DAM_EXISTING  0.000000    195.051200
HYDRO_DAM   67.717942   1.271600
HYDRO_RIVER_EXISTING    0.000000    204.820000
IND_BOILER_OIL  2.053610    0.532362
IND_BOILER_COAL 4.179935    1.081855
IND_BOILER_WASTE    11.010126   2.849652
DEC_HP_ELEC 554.174644  320.791276
DEC_THERMAL_HP_GAS  77.077291   33.717477
DEC_BOILER_GAS  105.586089  41.161335
DEC_BOILER_OIL  33.514266   25.948450
H2_FROM_GAS 145.185290  59.178082
PYROLYSIS   132.200818  112.392123

Storage technology  C_inv [MCHF/y]  C_maint [MCHF/y]
HYDRO_STORAGE   0.000000    0.000000

Resource    C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE    702.000000
DIESEL  96.390000
OIL 267.787558
NG  1648.527242
WOOD    592.110000
COAL    84.504083
URANIUM 18.277626
WASTE   0.000000

我的所有CSV文件都有不同的子表名称,但数量很少,我可以手动输入它们,以便在需要时检测它们。
另一个问题是,许多标题都包含空格(如“存储技术”),Pandas将其视为两栏。
我最初尝试直接用Pandas和手动拆分来做,但是允许避免错误的参数on_bad_lines='skip'也跳过了有用的行:

Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]'] 
Cost_bd.columns = colnames

我相信扫描.txt文件并将其拆分可能会更好,但我不确定如何以最佳方式进行此操作。
我还尝试使用此feed中提供的解决方案

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]

df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}

但它不起作用:

tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])

编辑:基于@Rabinzel的最终解决方案:

import re
def make_df(group,dict_of_dfs):
    header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
    if len(header) != len(data[0]): # If missing columns list, take former
        header  = header + dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
    dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
    return dict_of_dfs

def Read_csv_as_df(path, file_name):
    
    with open(path+file_name) as f:
            dict_of_dfs = {}
            group = []
            for line in f:
                if line!='\n':
                    group.append(line.strip())
                else:
                    print(dict_of_dfs)
                    dict_of_dfs = make_df(group,dict_of_dfs)
                    group = []
            dict_of_dfs = make_df(group,dict_of_dfs)
    return dict_of_dfs
t2a7ltrp

t2a7ltrp1#

我会用下面的方法来做。遍历每一行,把每一个由换行符分隔的块附加到一个列表中,并从列表中构建 Dataframe 。关于列名中有空格的问题,使用re.split,并且只有当有两个或更多空格时才进行拆分。把不同的df保存在一个字典中,字典中的key是每个df标题的第一个元素。

import re
def make_df(group):
    header, data = re.split(r'\s\s+',group[0]), list(map(str.split, group[1:]))
    dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)

with open('your_csv_file.csv') as f:
    dict_of_dfs = {}
    group = []
    for line in f:
        if line!='\n':
            group.append(line.strip())
        else:
            make_df(group)
            group = []
    make_df(group)

for key, value in dict_of_dfs.items():
    print(f"{key=}\ndf:\n{value}\n---------------------")

输出量:

key='Technology'
df:
              Technology C_inv [MCHF/y] C_maint [MCHF/y]
0                NUCLEAR      70.308020        33.374568
1     HYDRO_DAM_EXISTING       0.000000       195.051200
2              HYDRO_DAM      67.717942         1.271600
3   HYDRO_RIVER_EXISTING       0.000000       204.820000
4         IND_BOILER_OIL       2.053610         0.532362
5        IND_BOILER_COAL       4.179935         1.081855
6       IND_BOILER_WASTE      11.010126         2.849652
7            DEC_HP_ELEC     554.174644       320.791276
8     DEC_THERMAL_HP_GAS      77.077291        33.717477
9         DEC_BOILER_GAS     105.586089        41.161335
10        DEC_BOILER_OIL      33.514266        25.948450
11           H2_FROM_GAS     145.185290        59.178082
12             PYROLYSIS     132.200818       112.392123
---------------------
key='Storage technology'
df:
  Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
0      HYDRO_STORAGE       0.000000         0.000000
---------------------
key='Resource'
df:
      Resource C_op [MCHF/y]
0  ELECTRICITY   1174.452848
1     GASOLINE    702.000000
2       DIESEL     96.390000
3          OIL    267.787558
4           NG   1648.527242
5         WOOD    592.110000
6         COAL     84.504083
7      URANIUM     18.277626
8        WASTE      0.000000
---------------------

相关问题