如何跳过csv文件中的无效(空格分隔)行,而不是跳过前N行

o7jaxewo  于 2023-05-11  发布在  其他
关注(0)|答案(2)|浏览(120)

我有一个csv文件,它看起来像:

Provided by someone, unformatted by another-one for some purposes,
another-one would never hand out such a mess ;)

Have fun!

$ Column names
:azimuth:zenith:bjorkeny:energy:pos_x:pos_y:pos_z:proba_track:proba_cscd

$ Data
0:2,3495370211373316:1,1160038417256017:0,04899799823760986:3,3664000034332275:52,74:28,831:401,18600000000004:0,8243512974051896:0,17564870259481039
1:5,575785663044353:1,7428377336692398:0,28047099709510803:3,890000104904175:48,369:29,865:417,282:0,8183632734530938:0,18163672654690619
2:4,656124692722159:2,686909147834136:0,1198429986834526:3,2335000038146973:71,722:121,449:363,077:0,8283433133732535:0,17165668662674652

pandas中阅读该文件需要跳过第一行,并将注解定义为$

pd.read_csv(
  'data/neutrinos.csv', 
  on_bad_lines='skip', 
  sep=':',
  skiprows=5,
  comment='$',
  index_col=0,
  decimal=','
)

是否有一种更通用的方法,可以跳过所有空格分隔的行,而无需定义要跳过的行数或注解符号?先谢谢你。

oipij1gg

oipij1gg1#

最安全的方法是首先解析csv文件,跳过所有只有1个值的行,然后将其用作输入:

import csv
import io

csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)

newcsv = ''
with open('test.csv', newline='') as f:
    reader = csv.reader(f, 'mycsv')
    for row in reader:
        if len(row) > 1:
            newcsv += ':'.join(row) + '\n'

df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=':',
  index_col=0,
  decimal=','
)

输出:

azimuth    zenith  bjorkeny  energy   pos_x    pos_y    pos_z  proba_track  proba_cscd
0  2.349537  1.116004  0.048998  3.3664  52.740   28.831  401.186     0.824351    0.175649
1  5.575786  1.742838  0.280471  3.8900  48.369   29.865  417.282     0.818363    0.181637
2  4.656125  2.686909  0.119843  3.2335  71.722  121.449  363.077     0.828343    0.171657

请注意,对于一个大文件,您可能希望逐行构建数据框(在csv读取器循环中),而不是创建一个巨大的newcsv字符串。

别在工作时这么做

对于您的示例数据,在非生产环境中,您可以使用on_bad_lines回调的副作用来使用 * 实际 * 行填充列表:

data = []
_ = pd.read_csv('test.csv', on_bad_lines=lambda l:data.append(l), sep=':', index_col=0, decimal=',', engine='python')
df = pd.DataFrame(data[1:], columns=['idx'] + data[0][1:]).set_index('idx')

输出:

azimuth              zenith  ...         proba_track           proba_cscd
idx                                          ...
0    2.3495370211373316  1.1160038417256017  ...  0.8243512974051896  0.17564870259481039
1     5.575785663044353  1.7428377336692398  ...  0.8183632734530938  0.18163672654690619
2     4.656124692722159   2.686909147834136  ...  0.8283433133732535  0.17165668662674652
mitkmikd

mitkmikd2#

CASE-I:假设第一个'$'是列标题。我找不到任何“Pandas”解决方案,但有一种方法可以实现这一点:

with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    lines = f.readlines()

skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]

这将为www.example.com _csv()中的参数'skiprows'提供skiprows值pandas.read。
对于存在数百万行并且执行时间是敏感因素的大型CSV,同样可以临时修改为-

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if line[0]=='$':
            skiprows = i
            break
        i+=1

案例II(假设无)

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if len(line.split(':')) > len(line.split(" ")):
            skiprows = i-1 # -1 to reach the line that mentions $ Columns
            break

        i+=1

用空格分隔的行将比用“:”分隔的行具有更多的空格,因此比较是在相同的逻辑上进行的。

相关问题