pandas 如何读取CSV格式的文件,但不同的扩展名?

dkqlctbz  于 2022-11-27  发布在  其他
关注(0)|答案(1)|浏览(125)

我有一个数据集,它从第3行开始就有一个很好的 Dataframe 结构。对于第一行,不幸的是分隔符是多样的,而且有一些信息要包含在我的 Dataframe 中。文件大多是CSV结构,但它们有WOC、WOL、WPL等扩展名。
WOC文件的第一行如下所示:

Person:?,?;F dob. ?  MT: ? Z:C NewYork Mon.:S St.?

144 cm/35 Kg/5 YearsOld




45,34,22,26,0
78,74,82,11,0

下列值的标题应如下所示:

A, B, C, D, E
45,34,22,26,0
78,74,82,11,0

下面是我的尝试:

df44 = pd.DataFrame() # creates empty dataframe

for f in glob.glob('file_path_to_single_file'):

    with open(f, 'rb') as file:
        encodings = chardet.detect(file.read())["encoding"]
    a = pd.read_csv(f,sep='\s+|;|,', engine='python', encoding=encodings,header=None,names=['A','B', 'C', 'D', 'E'], skiprows=2)
    df44 = df44.append(a)

读取这样一个文件的最佳方式是什么,以便我还可以提取身高、体重、年龄和城市?
我的预期输出为:

A, B, C, D, E, City, Height, Weight, Age
45,34,22,26,0,NewYork, 144,    35,   5
78,74,82,11,0,NewYork, 144,    35,   5
oipij1gg

oipij1gg1#

根据您在上面的评论中提供的其他信息,我认为您可以从以下内容开始构建您的解决方案:

`# I created a file 'data.woc' with data as stream from your question:`
import pandas as pd
from io import StringIO
import re
stack_data = '''Person:?,?;F dob. ?  MT: ? Z:C NewYork Mon.:S St.?

144 cm/35 Kg/5 YearsOld




45,34,22,26,0
78,74,82,11,0'''

# read heading rows, I arbitrally chose 5 rows to read

with open('data.woc', 'r') as f:
    heading_rows = [next(f) for _ in range(5)]

city = re.findall(pattern = ' \w+ ', string = heading_rows[0])[0].strip()

numbers_list = [re.findall(pattern='\d+', string=row) for row in heading_rows if 'cm' and 'kg' in row.lower()][0]

height, weight, age = [int(numbers_lst[i]) for i in range(3)]
    
df = pd.read_csv('data.woc', sep='\s+|;|,', skiprows=2,comment='cm', index_col=None, names=list('ABCDE'))
    
df.dropna(inplace=True)

相关问题