pandas读取了一个蹩脚的csv文件

pxy2qtax  于 2023-06-27  发布在  其他
关注(0)|答案(2)|浏览(134)

我得到了一个写得很差的csv文件,我想使用pandas的read_csv加载。下面是前几行,说明它的外观和生成的错误。文件test.csv

feature_idx,cv_scores,avg_score,total-features
(4,),[0.71657    0.75430665 0.77866281 0.85293036 0.76370522],0.773235007449579,80
(4, 15),[0.79150981 0.82751849 0.83777517 0.9246948  0.82462535],0.8412247254527763,80
(1, 4, 15),[0.82173419 0.85052599 0.86065046 0.93704226 0.84315839],0.862622256166522,80
(1, 4, 15, 70),[0.82448556 0.86513518 0.87640778 0.93881338 0.84777784],0.8705239466728865,80

当我尝试加载它时:

pandas.read_csv('test.csv')

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6

我理解这是因为第一个文件是tuple。如何让pandas知道第一个字段是tuple,以便将(..)之间的所有内容视为一个字段?

编辑

现在的答案,还不行。

df = pd.read_csv('test.csv', converters={'feature_idx': parse_tuple}) # parse_tuple as per the answer

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6

# pandas version
>>> print(pd.__version__)
1.5.3
xwbd5t1u

xwbd5t1u1#

parse_tuple()函数使用ast.literal_eval()来安全地评估元组值并将它们作为实际元组返回。

import pandas as pd
import ast

# Define a converter function to parse tuples
def parse_tuple(value):
    return ast.literal_eval(value)

# Read the CSV file with the converter function
df = pd.read_csv('test.csv', converters={'feature_idx': parse_tuple})

# Print the DataFrame
print(df)

或者,您可以使用以下代码:

import pandas as pd
import ast

# Custom function to parse tuples
def parse_tuple(string):
    try:
        return ast.literal_eval(string)
    except (SyntaxError, ValueError):
        return string

df = pd.read_csv('test.csv')

df['feature_idx'] = df['feature_idx'].apply(parse_tuple)

print(df)
ttp71kqs

ttp71kqs2#

我会用 regex 分隔符read_csv忽略包含的逗号(,)

from ast import literal_eval

df = pd.read_csv("file.txt", sep=",(?![^(]*[)])", engine="python")

df["feature_idx"] = df["feature_idx"].apply(literal_eval)

df["cv_scores"] = df["cv_scores"].str.strip("[]").str.split()

正则表达式:[ demo ]
输出:

>>> print(df)

      feature_idx                      cv_scores  avg_score  total-features
0            (4,)  [0.71657, 0.75430665, 0.77...   0.773235              80
1         (4, 15)  [0.79150981, 0.82751849, 0...   0.841225              80
2      (1, 4, 15)  [0.82173419, 0.85052599, 0...   0.862622              80
3  (1, 4, 15, 70)  [0.82448556, 0.86513518, 0...   0.870524              80
  • 使用元素类型:*
feature_idx     <class 'tuple'>
cv_scores        <class 'list'>
avg_score       <class 'float'>
total-features    <class 'int'>

相关问题