如何阻止Pandas(Python)阅读我希望它跳过的行中的分隔符?

qq24tv8q  于 2023-02-02  发布在  Python
关注(0)|答案(2)|浏览(90)

我正在使用Pandas读取一个用;分隔的日志文件表,该表包含16行日志文件信息,这些行前面用#标识。

# Logger type: CL2000
# HW rev: 7.2x
# FW rev: 5.79
# Logger ID: id0001
# Session No.: 94
# Split No.: 1
# Time: 20200222T230231
# Value separator: ";"
# Time format: 4
# Time separator: ""
# Time separator ms: ""
# Date separator: ""
# Time and date separator: "T"
# Bit-rate: 500000
# Silent mode: false
# Cyclic mode: false
Timestamp;Type;ID;Data
22T230231142;0;ad;1100000000000000
22T230231143;0;ac;0000f5ff04000000
22T230231143;0;ab;0000000000000000
22T230231143;0;aa;0000090000008000
22T230231143;0;a8;21005ac15cffd7ff
...

问题是当我使用Pandas读取这个文件时,我告诉它跳过带有header字段的16行头,但是行# Value separator: ";"会破坏csv_read命令,因为它看到了分隔符。
呼叫

sample = pd.read_csv(filename, header=11, delimiter=';')
print(sample)

给了我

Timestamp  Type  ID              Data
0      22T230231142     0  ad  1100000000000000
1      22T230231143     0  ac  0000f5ff04000000
2      22T230231143     0  ab  0000000000000000
...

这是我要找的正确输出,而“应该”是正确的函数调用

sample = pd.read_csv(filename, header=16, delimiter=';')
print(sample)

产出

22T230231143  0  a8  21005ac15cffd7ff
0      22T230231144  0  a7  0e00000006000000
1      22T230231144  0  a6  aeffa9ff90ff0000
2      22T230231144  0  a5  59054a003d0083d5
...

其中第一行是从数据中间取出的列的名称。当我从标题中删除;字符并调用csv_read(filename, header=16, delimiter=';')时,获得了预期的输出,因此它一定是分号。我在read_csv或read_table的文档中找不到如何解决这个问题,因此如果有人知道这一点,将非常有帮助。

uqzxnwby

uqzxnwby1#

您可以将注解参数设置为“#”;Pandas会自动选择第一行没有#作为你的标题:

data = '''...wrapped your data here...'''

#add the comments argument
#it will pick the first row after the hash
pd.read_csv(StringIO(data),comment='#',delimiter=';')

    Timestamp     Type  ID  Data
0   22T230231142    0   ad  1100000000000000
1   22T230231143    0   ac  0000f5ff04000000
2   22T230231143    0   ab  0000000000000000
3   22T230231143    0   aa  0000090000008000
4   22T230231143    0   a8  21005ac15cffd7ff
t2a7ltrp

t2a7ltrp2#

我认为为此,您需要使用一个小regex来读入文件,并解析出要跳到的行,因为Pandas将在value sep行中读取';'

import re

with open(r"your_file.csv",'r') as fin:
    for number,row in enumerate(fin):
        if re.match(r'# Value separator: ";"',row):
            row_start = number
        if not re.match('^#',row):
            skip_val = (number - row_start) + 2 # to account for 0 index & header
            break

df = pd.read_csv(your_file,sep=';',skiprows=skip_val)

print(df)

      Timestamp  Type  ID               Data
0  22T230231142     0  ad   1100000000000000
1  22T230231143     0  ac   0000f5ff04000000
2  22T230231143     0  ab   0000000000000000
3  22T230231143     0  aa   0000090000008000
4  22T230231143     0  a8   21005ac15cffd7ff

相关问题