pandas 字符串中的字符特定条件检查

nr9pn0ug  于 2023-01-24  发布在  其他
关注(0)|答案(1)|浏览(131)

我必须使用python读取和分析一些日志文件,这些文件通常包含以下所需格式的字符串:

date Don Dez 10 21:41:41.747 2020
base hex  timestamps absolute
no internal events logged
// version 13.0.0
//28974.328957 previous log file: 21-41-41_Voltage.asc
// Measurement UUID: 9e0029d6-43a0-49e3-8708-3ec70363124c
28976.463987    LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:20.6,65.48,11.99,0.009843,12,0.01078,11.99,0.01114,11.99,0.01096,12,0.009984,4.595,0,1.035,0,0.1745,0,2,OM_2_1,0"
28978.600018    LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:22.7,65.47,11.99,0.009896,12,0.01079,11.99,0.01117,11.99,0.01097,12,0.009965,4.628,0,1.044,0,0.1698,0,2,OM_2_1,0"

但是,有时会发生创建的文件具有不需要的格式,如以下:

date Die Jul 13 08:40:22.878 2021                                                                                                                                                                   
base hex  timestamps absolute                                                                                                                                                                   
no internal events logged                                                                                                                                                                   
// version 13.0.0                                                                                                                                                                   
//1035.595166 previous log file: 08-40-22_Voltage.asc                                                                                                                                                                   
// Measurement UUID: 2baf3f3f-300a-4f0a-bcbf-0ba5679d8be2                                                                                                                                                                   
"1203.997816    LoggingString := ""Log" 9:01 am  Tuesday     July 13    2021    09:01:58.3  24.53   13.38   0.8948  13.37   0.8801  13.37   0.89    13.37   0.9099  13.47   0.8851  4.551   0.00115 0.8165  0   0.2207  0   5   OM_3_2  1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   "0"""
"1206.086064    LoggingString := ""Log" 9:02 am  Tuesday     July 13    2021    09:02:00.4  24.53   13.37   0.8945  13.37   0.8801  13.37   0.8902  13.37   0.9086  13.46   0.8849  5.142   0.001185    1.033   0   0.1897  0   5   OM_3_2  1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   "0"""

date    Mit Jun 16  10:11:43.493    2021                                                                                                                                                                    
base    hex timestamps  absolute                                                                                                                                                                            
no  internal    events  logged                                                                                                                                                                          
//  version 13.0.0                                                                                                                                                                              
//  Measurement UUID:   fe4a6a97-d907-4662-89f9-bd246aa54a33                                                                                                                                                                            
10025.661597    LoggingString   :=  """""""Log"""   12:59   PM  Wednesday   June    16  2021    12:59:01.1  66.14   0.00423 0   0.001206    0   0.001339    0   0.001229    0   0.001122    0   0.05017 0   0.01325 0   0.0643  0   0   OM_2_1_transition   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   """0"""""""
10030.592652    LoggingString   :=  """""""Log"""   12:59   PM  Wednesday   June    16  2021    12:59:06.1  66.14   11.88   0.1447  11.88   0.1444  11.88   0.1442  11.87   0.005552    11.9    0.00404 2.55    0   0.4712  0   0.09924 0   0   OM_2_1_transition   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   """0"""""""

因为我只关心“// Measurement UUID“行下面的数据,所以我使用以下代码从所需格式的字符串中提取数据:

files = os.listdir(directory)
    files = natsorted(files)
    for file in files:
        base, ext = os.path.splitext(file)
        if file not in processed_files and ext == '.asc':
            print("File added:", file)
            file_path = os.path.join(directory, file)
            count = 0
            with open(file_path, 'r') as file_in:
                processed_files.append(file)
                Output_list = []  # Each string from file is read into this list
                Final = []  # Required specific data from each string is isolated & stored here
                for line in map(str.strip, file_in):
                    if "LoggingString" in line:
                        first_quote = line.index(
                            '"')  # returns the column number where " first appears in the whole string
                        last_quote = line.index('"', first_quote + 1)
                        # returns the column value where " appears last in the whole string ( end of line )
                        # print(first_quote)
                        Output_list.append(
                            line[:first_quote].split(maxsplit=1)
                            + line[first_quote + 1: last_quote].split(","),
                        )
                        Final.append(Output_list[count][7:27])

不需要的格式包含一个或多个空格之间的每个字符串字符如上所示。我猜这是因为日志文件生成器有时生成一个非逗号分隔的文件或逗号分隔的文件与错误可能,我不确定。
我试着把条件放在后面:

if "LoggingString" in line :
    if ',' in line:
        first_quote = line.index('"')
        last_quote = line.index('"', first_quote + 1)
        Output_list.append(line[:first_quote].split(maxsplit=1)
                            + line[first_quote + 1: last_quote].split(","),)
                        Final.append(Output_list[count][7:27])
    else:
        raise Exception("Error in File")

然而,这并没有达到目的,因为如果在任何其他不需要的格式中,即使字符串中有一个',',程序也会认为它有效并处理它,这会导致错误的结果。
如何确保在处理完包含所需格式字符串的文件后,如果处理了其他文件,则会打印出错误消息?这里可以实现什么类型的条件检查?

dbf7pr2w

dbf7pr2w1#

您可以使用带有 * regex * 分隔符的pandas.read_csv

import glob
import pandas as pd

l = []
for f in glob.glob("/tmp/Log*.txt"):
    df = (pd.read_csv(f, sep=',|(?<=[\w"])\s+(?=[\w"])',
                      header=None, skiprows=6, engine="python").iloc[:, 2:28])
    df.insert(0, "filename", f.split("\\", )[-1])
    l.append(df)
    
out = pd.concat(l)

输出:

相关问题