pandas 在Python中插入csv文件的标题行

dfty9e19  于 2023-01-04  发布在  Python
关注(0)|答案(1)|浏览(150)

我尝试使用以下代码段为包含35M行的csv文件插入标题行

import csv
    
    with open('E:\\Dataset\\dataset1.csv') as infile:
        text = infile.read()
    header = ['User IP','Top-level domain', 'Timestamp', 'Is Attack', 'Request',
                  'Len(request) withou TLD', 'Subdomains_count', 'w_count', 'w_max',
                  'entropy', 'w_max_ratio', 'w_count_ratio', 'digits_ratio', 'uppercase_ratio',
                  'time_avg', 'time_stdev', 'size_avg', 'size stdev', 'throughput', 'unique', 'entropy_avg'
                  'entropy_stdev']
    
    with open('E:\\Dataset\\dataset2.csv', 'w') as outfile:
        # join the headers into a string with commas and add a newline
        outfile.write(f"{','.join(header)}\n") 
        outfile.write(text)

然而,当我试图打印带有表头的数据时,表头移动了一列。这是数据的原始表头(没有表头)

186.169.253.58     surbl.org  1624438272607  False  \
0  186.169.253.58     surbl.org  1624438272607  False   
1  186.169.253.58  spamhaus.org  1624438273058  False   
2  186.169.253.58  spamhaus.org  1624438273058  False   
3  186.169.253.58  spamhaus.org  1624438273059  False   
4  186.169.253.58  spamhaus.org  1624438273059  False   

                       h.surbl.org   1  1.1  0  0.1      -0.0       0.0  \
0                      f.surbl.org   1    1  0    0 -0.000000  0.000000   
1  118.141.11.106.sbl.spamhaus.org  18    5  0    0  2.633731  0.000000   
2  118.141.11.106.zen.spamhaus.org  18    5  1    3  2.633731  0.166667   
3  128.141.11.106.sbl.spamhaus.org  18    5  0    0  2.863826  0.000000   
4  128.141.11.106.zen.spamhaus.org  18    5  1    3  2.863826  0.166667   

      0.0.1     0.0.2  0.0.3  3.4444444444444446  9.59311095410544   1.5  \
0  0.000000  0.000000    0.0            0.222222          0.440959   1.0   
1  0.000000  0.611111    0.0           55.555556        165.542375  17.2   
2  0.055556  0.611111    0.0            0.333333          0.500000  17.2   
3  0.000000  0.611111    0.0            0.333333          0.500000  17.3   
4  0.055556  0.611111    0.0            0.333333          0.500000  17.4   

   1.5811388300841898        468.75  0.4444444444444444  0.25849625007211563  \
0            0.000000   3333.333333            0.555556             0.000000   
1            0.421637    343.313373            0.000000             3.048277   
2            0.421637  43000.000000            0.000000             2.983547   
3            0.483046  43250.000000            0.000000             2.959741   
4            0.516398  43500.000000            0.000000             2.935936   

   0.81743691684035  
0          0.000000  
1          0.177285  
2          0.199622  
3          0.198131  
4          0.193400

这是添加header后的数据头

User IP  Top-level domain  Timestamp  \
186.169.253.58     surbl.org     1624438272607      False   
186.169.253.58     surbl.org     1624438272607      False   
186.169.253.58  spamhaus.org     1624438273058      False   
186.169.253.58  spamhaus.org     1624438273058      False   
186.169.253.58  spamhaus.org     1624438273059      False   

                                      Is Attack  Request  \
186.169.253.58                      h.surbl.org        1   
186.169.253.58                      f.surbl.org        1   
186.169.253.58  118.141.11.106.sbl.spamhaus.org       18   
186.169.253.58  118.141.11.106.zen.spamhaus.org       18   
186.169.253.58  128.141.11.106.sbl.spamhaus.org       18   

                Len(request) withou TLD  Subdomains_count  w_count     w_max  \
186.169.253.58                        1                 0        0 -0.000000   
186.169.253.58                        1                 0        0 -0.000000   
186.169.253.58                        5                 0        0  2.633731   
186.169.253.58                        5                 1        3  2.633731   
186.169.253.58                        5                 0        0  2.863826   

                 entropy  w_max_ratio  w_count_ratio  digits_ratio  \
186.169.253.58  0.000000     0.000000       0.000000           0.0   
186.169.253.58  0.000000     0.000000       0.000000           0.0   
186.169.253.58  0.000000     0.000000       0.611111           0.0   
186.169.253.58  0.166667     0.055556       0.611111           0.0   
186.169.253.58  0.000000     0.000000       0.611111           0.0   

                uppercase_ratio    time_avg  time_stdev  size_avg  \
186.169.253.58         3.444444    9.593111         1.5  1.581139   
186.169.253.58         0.222222    0.440959         1.0  0.000000   
186.169.253.58        55.555556  165.542375        17.2  0.421637   
186.169.253.58         0.333333    0.500000        17.2  0.421637   
186.169.253.58         0.333333    0.500000        17.3  0.483046   

                  size stdev  throughput    unique  entropy_avgentropy_stdev  
186.169.253.58    468.750000    0.444444  0.258496                  0.817437  
186.169.253.58   3333.333333    0.555556  0.000000                  0.000000  
186.169.253.58    343.313373    0.000000  3.048277                  0.177285  
186.169.253.58  43000.000000    0.000000  2.983547                  0.199622  
186.169.253.58  43250.000000    0.000000  2.959741                  0.198131

看起来像是使用第一列作为索引。

qnyhuwrf

qnyhuwrf1#

如果您能够利用它,请使用Pandas,因为您正在处理的标题问题可以通过可用的明确导入和导出标志来解决。

import pandas

header = ['User IP', 'Top-level domain', 'Timestamp', 'Is Attack', 'Request',
          'Len(request) withou TLD', 'Subdomains_count', 'w_count', 'w_max',
          'entropy', 'w_max_ratio', 'w_count_ratio', 'digits_ratio', 'uppercase_ratio',
          'time_avg', 'time_stdev', 'size_avg', 'size stdev', 'throughput', 'unique', 'entropy_avg',
                                                                                      'entropy_stdev']
pandas.read_csv('E:\\Dataset\\dataset1.csv', names=header, index_col=False)
pandas.to_csv('E:\\Dataset\\dataset2.csv', header=True, index=False)

你的问题中并不清楚你是否想要输出中的索引,或者它们是否存在于输入中,如果它们存在于输入中,设置index_col=0,如果你想要它们存在于输出中,在to_csv调用中设置index=True。

相关问题