检测CSV文件中的多个标头

sd2nnvve  于 2023-06-03  发布在  其他
关注(0)|答案(1)|浏览(121)

我有一个很大的CSV文件。在csv文件中有几个头文件。我已经分开了三个标题使用代码如下。

import pandas as pd

df = pd.read_csv("three measurement.csv", header=None)

# find header rows
df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
# create groups for each section
groupings = df.iloc[:, 0].str.contains("|".join(df_titles)).cumsum()

# split into new dataframes as dictionary
d = {}
for i, j in df.groupby(groupings):
    # define name of dictionary key as title, and set data of DF as values
    d[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
                                   # create MultiIndex from 3 header rows
                                   columns=pd.MultiIndex.from_arrays(
                                       j.iloc[0:4, :].ffill(axis=1).values))
    # suggested not to use, but you can set the variables directly (outside of the dictionary)
    globals()[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
                                           columns=pd.MultiIndex.from_arrays(
                                               j.iloc[0:4, :].ffill(axis=1).values))

Level and DistortionTHD RatioReference Waveform是我分离的测量类型。但是,测量类型可以改变(增加或减少)如何检测测量类型,而无需像上述代码(df_titles=["..")中那样指定测量类型。这意味着在上面的代码中,要找到标题行,必须根据下面显示的csv指定度量类型的名称。如果度量类型不同并且是递增的,我如何找到标题行而不必定义df_title变量。此处CSV文件

"Level and Distortion",,,,,,,,,,,,,,,
"Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
21.1179638886716,0.00747175133180212,21.1179638886716,8.83327496082501E-05,21.1179638886716,0.000426696028852445,21.1179638886716,0.0122462876404656,21.1179638886716,0.00756340531214287,21.1179638886716,0.000181697169530165,21.1179638886716,0.000443499862648762,21.1179638886716,0.0108494276048029
"THD Ratio",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
Hz,%,Hz,%,,,,,,,,,,,,
20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
22.2984199401618,90.6674826441566,22.2984199401618,85.7190774666039,,,,,,,,,,,,
"Reference Waveform",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
s,V,s,V,,,,,,,,,,,,
0,0,0,0,,,,,,,,,,,,
2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,
4.16666666666667E-05,5.18304721721536E-07,4.16666666666667E-05,5.18304721721536E-07,,,,,,,,,,,,
6.25E-05,1.74923655865586E-06,6.25E-05,1.74923655865586E-06,,,,,,,,,,,,

实际上在csv文件中有很多行数据。
由于测量类型可以改变(增加或减少),因此有11种测量类型。我希望python能够检测度量类型,而不必像df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]那样定义它。
谢谢

eblbsuwk

eblbsuwk1#

您可以使用pd.read_csv和正确的参数:

import pandas as pd
import numpy as np

df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')

dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
    name = df.loc[i, 0]
    df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
    m = pd.to_numeric(df1[0], errors='coerce').notna()
    mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
    dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)

输出:

>>> list(dfs.keys())
['Level and Distortion',
 'THD Ratio',
 'Reference Waveform',
 'Summary: Signal Path1.Acoustic Response.Level and Distortion',
 'Summary: Signal Path1.Acoustic Response.THD Ratio',
 'Summary: Signal Path1.Acoustic Response.Reference Waveform']

>>> dfs['Level and Distortion']
          Ch1 (F)             Ch1 (H2)             Ch1 (H3)           Ch1 (Total)                 Ch2 (F)             Ch2 (H2)             Ch2 (H3)           Ch2 (Total)          
                X         Y          X         Y          X         Y           X         Y             X         Y          X         Y          X         Y           X         Y
               Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms            Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms
0       20.000000  0.007720  20.000000  0.000056  20.000000  0.000390   20.000000  0.011493     20.000000  0.006998  20.000000  0.000151  20.000000  0.000390   20.000000  0.010080
1       21.117964  0.007472  21.117964  0.000088  21.117964  0.000427   21.117964  0.012246     21.117964  0.007563  21.117964  0.000182  21.117964  0.000443   21.117964  0.010849
2       22.298420  0.005521  22.298420  0.000107  22.298420  0.000408   22.298420  0.011866     22.298420  0.007572  22.298420  0.000209  22.298420  0.000505   22.298420  0.012603
3       23.544861  0.007055  23.544861  0.000269  23.544861  0.000313   23.544861  0.012183     23.544861  0.007603  23.544861  0.000227  23.544861  0.000568   23.544861  0.012779
4       24.860977  0.008749  24.860977  0.000260  24.860977  0.000653   24.860977  0.013460     24.860977  0.005882  24.860977  0.000217  24.860977  0.000364   24.860977  0.013626
..            ...       ...        ...       ...        ...       ...         ...       ...           ...       ...        ...       ...        ...       ...         ...       ...
123  16089.472533  0.060342        NaN       NaN        NaN       NaN         NaN       NaN  16089.472533  0.061988        NaN       NaN        NaN       NaN         NaN       NaN
124  16988.844997  0.059465        NaN       NaN        NaN       NaN         NaN       NaN  16988.844997  0.047697        NaN       NaN        NaN       NaN         NaN       NaN
125  17938.490757  0.036990        NaN       NaN        NaN       NaN         NaN       NaN  17938.490757  0.035544        NaN       NaN        NaN       NaN         NaN       NaN
126  18941.220002  0.061363        NaN       NaN        NaN       NaN         NaN       NaN  18941.220002  0.050418        NaN       NaN        NaN       NaN         NaN       NaN
127  20000.000000  0.064279        NaN       NaN        NaN       NaN         NaN       NaN  20000.000000  0.059796        NaN       NaN        NaN       NaN         NaN       NaN

[128 rows x 16 columns]

>>> dfs['THD Ratio']
              Ch1                      Ch2           
                X          Y             X          Y
               Hz          %            Hz          %
0       20.000000  83.009797     20.000000  82.146099
1       21.117964  85.365663     21.117964  82.033847
2       22.298420  90.667483     22.298420  85.719077
3       23.544861  86.536381     23.544861  85.941987
4       24.860977  83.845657     24.860977  91.810750
..            ...        ...           ...        ...
112   8845.029526  58.356075   8845.029526  37.976407
113   9339.450707  32.391086   9339.450707  58.039594
114   9861.509138  34.756184   9861.509138  38.396633
115  10412.749693  31.359953  10412.749693  20.900020
116  10994.803600  35.219231  10994.803600  26.008255

[117 rows x 4 columns]

>>> dfs['Reference Waveform']
              Ch1                      Ch2              
                X             Y          X             Y
                s             V          s             V
0        0.000000  0.000000e+00   0.000000  0.000000e+00
1        0.000021  6.478902e-08   0.000021  6.478902e-08
2        0.000042  5.183047e-07   0.000042  5.183047e-07
3        0.000063  1.749237e-06   0.000063  1.749237e-06
4        0.000083  4.146200e-06   0.000083  4.146200e-06
...           ...           ...        ...           ...
719994  14.999875  3.035885e-08  14.999875  3.035885e-08
719995  14.999896 -7.711807e-09  14.999896 -7.711807e-09
719996  14.999917  2.917104e-10  14.999917  2.917104e-10
719997  14.999937  3.821995e-10  14.999937  3.821995e-10
719998  14.999958 -4.497611e-11  14.999958 -4.497611e-11

[719999 rows x 4 columns]

>>> dfs['Summary: Signal Path1.Acoustic Response.Level and Distortion']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

>>> dfs['Summary: Signal Path1.Acoustic Response.THD Ratio']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

>>> dfs['Summary: Signal Path1.Acoustic Response.Reference Waveform']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

连接所有 Dataframe :

>>> pd.concat(dfs, names=['Measurement'], axis=1)

Measurement Level and Distortion                                 ... Summary: Signal Path1.Acoustic Response.THD Ratio Summary: Signal Path1.Acoustic Response.Reference Waveform                                      
                         Ch1 (F)             Ch1 (H2)            ...                                Passed Upper Limit                                                    Channel Passed Lower Limit Passed Upper Limit
                               X         Y          X         Y  ...                                               Ch1                                                        Ch1                Ch1                Ch1
                              Hz      Vrms         Hz      Vrms  ...                                               Ch2                                                        Ch2                Ch2                Ch2
0                      20.000000  0.007720  20.000000  0.000056  ...                                               NaN                                                NaN                        NaN                NaN
1                      21.117964  0.007472  21.117964  0.000088  ...                                               NaN                                                NaN                        NaN                NaN
2                      22.298420  0.005521  22.298420  0.000107  ...                                               NaN                                                NaN                        NaN                NaN
3                      23.544861  0.007055  23.544861  0.000269  ...                                               NaN                                                NaN                        NaN                NaN
4                      24.860977  0.008749  24.860977  0.000260  ...                                               NaN                                                NaN                        NaN                NaN
...                          ...       ...        ...       ...  ...                                               ...                                                ...                        ...                ...
719994                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719995                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719996                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719997                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719998                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN

[719999 rows x 33 columns]

相关问题