regex 在Python中使用正则表达式解析日志文件以提取值

laawzig2  于 2023-04-22  发布在  Python
关注(0)|答案(1)|浏览(144)

--编辑以删除cols参数,该参数来自以前解析这些文件的尝试,我未能从我的最小工作示例中删除。
我正在尝试使用python解析一个带有文本时间的日志文件,如下所示,这样我就可以看到像filament current这样的东西在日志文件之间是如何变化的。

Format               : GPF: internal memory mode
Shape Type           : TRAPEZIUM
Number of Blocks     : 45
Main field placement : FLOATING
Sub field placement  : LOWERLEFT
Fracture style       : SUBFIELD
Number of Main bits  : 20
Number of Sub bits   : 14
Work Level           : 1
Number freq. factors : 42
Minimum frequency    : 0.671637
Maximum frequency    : 1.009999
Reference frequency    : 0x80000000
Pixel time           : 18852175201.000000 (1.88522e+10)
High Tension         : 100 kV
Block size           :       500.00000000 um,      500.00000000 um
Pattern size         :     60405.86000000 um,    60459.00000000 um
Resolution           :         0.00100000 um
Beamstepsize         :         0.01000000 um
Main resolution      :         0.00100000 um
Trap resolution      :         0.00050000 um

Archive: 10na_300um.beam_100             Date: 26-MAR-2023 
Version Beam      4.00

High Tension           :      100 kV     Final Aperture      :      300 um

Gun control :
bias voltage           :  -400.00  V     filament current    :     2.31  A
Extractor   voltage    :  6200.00  V     current             :   190.78 uA
tilt  X         25857  :  -28.820 mA     tilt  Y      40904  :   31.591 mA
shift X         37124  :   16.916 mA     shift Y      29171  :   -7.572 mA
AD setpoint dac 100kV  :     4095

Lens control:
Final lens      47660  :  4.14551  A
C1 setting      23871  : 4774.200  V     C2 setting    7020  :  1.01707  A

Beam quality:
Fine Focus dac         :     2337
Diagonal Stigmator dac :     1813        Axis Stigmator dac  :     1868

measured spotsize      :    0.024 um     beamcurrent         :    10.27 nA
PMHV setting (dac)     :     2602
BEAMPAR setting        :    300.0 s

Beam diameter       :           26 nm ,           28 nm  current : 10.106 nA

resolution          :     1.000000 nm ,     1.000000 nm
beam step size      :    10.000000 nm ,    10.000000 nm
main resolution     :     1.000000 nm ,     1.000000 nm  [1,1]
trap resolution     :     0.500000 nm ,     0.500000 nm  [20,20]
max. main field size:  1048.576000 um ,  1048.576000 um
max. trap field size:     4.525000 um ,     4.525000 um
tdd shift           :            3666 ,            3666

main compensation               x:  500.000 um    y:  500.000 um

trap compensation               x:    4.525 um    y:    4.525 um

the deflection compensation settings:
                               main     rel  trapezium  rel  pull-in
                             (12bits)         (12bits)      (12bits)
x gain     [bit]       :       2039       0    3331       0    578
y gain     [bit]       :       2068       0    3247       0    738
x rotation [bit]       :       1992       0    2120       0   2061
y rotation [bit]       :       2242       0    1605       0   1588
x keystone [bit]       :       1302       0
y keystone [bit]       :       2045       0

table translation settings      x:    0.000 um    y:    0.000 um
main beam deflection settings   x:    0.000 um    y:    0.000 um

the deflection compensation sensitivities:
(with respect to centre of the deflection field)
                              main       trapezium     pull-in
                             (12bits)     (12bits)    (12bits)
x gain     [ppm/bit]   :      2.442       12.963      19.004
y gain     [ppm/bit]   :      2.439       12.932      18.331
x rotation [ppm/bit]   :      4.908        5.828      38.192
y rotation [ppm/bit]   :      4.907        5.744      38.297
x keystone [ppm/mm/bit]:      1.487
y keystone [ppm/mm/bit]:      1.480
Calculated frequencies:
  Base frequency   :    1.2633 MHz
  Maximum frequency:    1.2759 MHz
  Minimum frequency:  848.4536 kHz

the last adjust main comp height @ relative position ***,*** um : 
22.600 um
Spot defocussed with FL 0 bits
Spot defocussed with FFOFFSET 0 bits

pg information dcd
Last restored file: /home/pg/archive/perftest.dcd_100
Date: 20:46 30-MAR-2023

Main:               Trap:               Stigmator:      SEM:
mddx2    2131,1746  tddxx    2098,1945  fspascor  2047  dcvmdmzoom    1
mddy2    2139,2293  tddxy    2074,2065  fspdscor  2047  dcvmdmpan     0,   0
mddxy2   1717,1686  tddxxy   2059,1982  fspffcor  2047  mdpsem     2367,1983
mddyx2   2427,1574  tddxx2   2101,2188  fspasfunc    0  dcvmdmstep   32
mddx3    1821,1776  tddxy2   2157,2127  fspdsfunc    0  dcvmdmrate    4
mddy3    2344,1757  tddyx    2052,2216  fspfffunc    0
mddorth  2040,2238  tddyy    1949,1989  dcvdc       31
mdpgain  1619,2427  tddyxy   2056,2023
mdpvout  3279,1041  tddyx2   2017,2065
mdpslow     0,   0  tddyy2   2060,2100
Beam Blanker:                           Pattern Generator:
bbsmode             PN                  beamondelay          0.00_ns
bbshvadjust         80.00_%,80.00_%     beamoffdelay        10.00_ns
bbszeroadjust        0.00_%, 0.00_%     negativecommondelay 25.00_ns

mbsbase             50000_ns            sbsbase               500_ns
mbsfactor             100_ns/um         sbsfactor             500_ns/um
mbsfocus             1000_ns/um

我使用re编写了以下代码,尝试提取所有用冒号分隔的数据。

fname = 'example.log'

colon_data = r'\s*(\S*\s*\S*\s*\S*)\s*:\s*(\S*\s*\S*)\s*|\n'

pat = re.compile(colon_data)

with open(fname, encoding='utf8') as f:
    contents = f.readlines()

data_dict = {}
for line in contents:
    if re.match(colon_data,line):
        data_dict[re.match(colon_data,line).groups()[0]] = re.match(colon_data,line).groups()[1]

data_dict

``

Which gives me the output below, which is close to what I want but it gets tripped up on cases such as `bias voltage           :  -400.00  V     filament current    :     2.31  A` where it only recognised the first value in the line and not the second. It also interprets everything as strings and it'd be nice to recognise floats as floats. Can anyone suggest how I might fix this?

'Format:GPF':'内部存储器','形状类型':'TRAPEZIUM\n','块数':“45\n”,“主字段放置”:'FLOATING\n','子字段放置':“LOWERLEFT\n”,“断开样式”:“SUBFIELD\n”,“工作级别”:'1\n','数字频率因子':'42\n','最小频率':'0.671637\n','最大频率':'1.009999\n','参考频率':'0x8000000\n','像素时间':'18852175201.000000(1.88522e+10)','高张力':“100 kV”,“块大小”:'500.00000000 um,','图案尺寸':'60405.86000000 um,','分辨率':'0.00100000 um','Beamstepsize ':'0.01000000 um','主分辨率':'0.00100000 um','陷阱分辨率':'0.00050000 um',无:无,'存档:10na_300um.beam_100日期':“2023年3月26日\n”,“高压”:“100千伏”,“枪支控制”:'','偏置电压':“-400.00 V”,“提取器电压”:'6200.00 V','tilt X 25857':“-28.820 mA”,“偏移X 37124”:“16.916 mA”,“透镜控制”:'','最终透镜47660':“4.14551 A”,“C1设置23871”:“4774.200 V”,“光束质量”:'','Fine Focus dac':'2337\n','对角柱头dac':“1813轴”,“测量的光斑尺寸”:“0.024 um”,“PMHV设置(dac)”:'2602\n','BEAMPAR设置':'300.0 s'}

jexiocij

jexiocij1#

修改正则表达式模式并将数字转换为浮点数,并修改其他一些内容,如修改正则表达式模式并使用re.findall代替re.match

import re

fname = 'example.log'

colon_data = r'\s*([\w\s().]+)\s*:\s*([-\d.]+)(?:\s*[a-zA-Z]+)?(?:\s{2,}|$)'
cols = []

pat = re.compile(colon_data)

with open(fname, encoding='utf8') as f:
    contents = f.readlines()

data_dict = {}
for line in contents:
    matches = re.findall(colon_data, line)
    for match in matches:
        key, value = match
        data_dict[key.strip()] = float(value) if '.' in value or '-' in value else int(value)

data_dict

相关问题