我有多个CSV文件,其中有多个表格的格式,由换行符分隔。
示例:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
NUCLEAR 70.308020 33.374568
HYDRO_DAM_EXISTING 0.000000 195.051200
HYDRO_DAM 67.717942 1.271600
HYDRO_RIVER_EXISTING 0.000000 204.820000
IND_BOILER_OIL 2.053610 0.532362
IND_BOILER_COAL 4.179935 1.081855
IND_BOILER_WASTE 11.010126 2.849652
DEC_HP_ELEC 554.174644 320.791276
DEC_THERMAL_HP_GAS 77.077291 33.717477
DEC_BOILER_GAS 105.586089 41.161335
DEC_BOILER_OIL 33.514266 25.948450
H2_FROM_GAS 145.185290 59.178082
PYROLYSIS 132.200818 112.392123
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
HYDRO_STORAGE 0.000000 0.000000
Resource C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE 702.000000
DIESEL 96.390000
OIL 267.787558
NG 1648.527242
WOOD 592.110000
COAL 84.504083
URANIUM 18.277626
WASTE 0.000000
我的所有CSV文件都有不同的子表名称,但数量很少,我可以手动输入它们,以便在需要时检测它们。
另一个问题是,许多标题都包含空格(如“存储技术”),Pandas将其视为两栏。
我最初尝试直接用Pandas和手动拆分来做,但是允许避免错误的参数on_bad_lines='skip'
也跳过了有用的行:
Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]']
Cost_bd.columns = colnames
我相信扫描.txt
文件并将其拆分可能会更好,但我不确定如何以最佳方式进行此操作。
我还尝试使用此feed中提供的解决方案
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]
df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
但它不起作用:
tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])
编辑:基于@Rabinzel的最终解决方案:
import re
def make_df(group,dict_of_dfs):
header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
if len(header) != len(data[0]): # If missing columns list, take former
header = header + dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
return dict_of_dfs
def Read_csv_as_df(path, file_name):
with open(path+file_name) as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
print(dict_of_dfs)
dict_of_dfs = make_df(group,dict_of_dfs)
group = []
dict_of_dfs = make_df(group,dict_of_dfs)
return dict_of_dfs
1条答案
按热度按时间t2a7ltrp1#
我会用下面的方法来做。遍历每一行,把每一个由换行符分隔的块附加到一个列表中,并从列表中构建 Dataframe 。关于列名中有空格的问题,使用
re.split
,并且只有当有两个或更多空格时才进行拆分。把不同的df保存在一个字典中,字典中的key是每个df标题的第一个元素。输出量: