regex 搜索多行花括号内的文本

2lpgd968  于 2023-11-20  发布在  其他
关注(0)|答案(2)|浏览(167)

我有一个大文件与类似的文本模式如下。

{ device_id: 'do0142', message: '[0,"xyz","something",{}]}

{
  device_id: 'Ampn05',
  message: '[0,"23","something",{"connect":1,"error":"something","info":"xyz","valid":"Unavailable","timestamp":"2020-03-15T04:33:32Z","vendorId":"cycle","country":"anywhere"}]}

{ device_id: 'do0142', message: '[0,"xyz","something",{}]}

{
  device_id: 'do0142',
  message: '[0,"23","something",{"connect":1,"error":"something","info":"xyz","valid":"Unavailable","timestamp":"2020-03-15T04:33:32Z","vendorId":"cycle","country":"anywhere"}]}

字符串
我想在花括号内搜索device_id,如果找到匹配项,则返回该花括号内的全部内容。
ex -如果我正在搜索device_id = 'do 0142',输出应该是这样的:

{ device_id: 'do0142', message: '[0,"xyz","something",{}]}

{ device_id: 'do0142', message: '[0,"xyz","something",{}]}

{
  device_id: 'do0142',
  message: '[0,"23","something",{"connect":1,"error":"something","info":"xyz","valid":"Unavailable","timestamp":"2020-03-15T04:33:32Z","vendorId":"cycle","country":"anywhere"}]}


我尝试在Python中使用正则表达式,但我只得到部分输出:

import re

file_name = "log.txt"
word = "do0142"
regex = r"(\[.*\])"

with open("log.txt", 'r', encoding="utf8") as input:
    line = input.read()

matches = re.finditer(regex, line, re.MULTILINE)

for match in enumerate(matches, start=1):
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        print ("{group}".format(group = match.group(groupNum)))


请帮我写一下Python代码。

8gsdolmq

8gsdolmq1#

一个 * 天真 * 的方法是打开dot-matches-all并捕获相关的块:

with open("log.txt", encoding="utf8") as f:
    key = re.escape("do0142")
    pat = "({\s*device_id:\s*'%s'.+?\s*}$)" % key
    matches = re.findall(pat, f.read(), flags=re.M|re.S)

字符串
演示:[ regex101 ]
输出量:

# print(*matches, sep="\n"*2)
    
{ device_id: 'do0142', message: '[0,"xyz","something",{}]}

{ device_id: 'do0142', message: '[0,"xyz","something",{}]}

{
  device_id: 'do0142',
  message: '[0,"23","something",{"connect":1,"error":"something","info":...

bhmjp9jg

bhmjp9jg2#

使用正则表达式无法可靠地解析结构化数据。相反,由于输入文件的格式显然是由一个或多个空行分隔的多个YAML文档,因此可以使用YAML解析器(如pyyaml)将每组非空行解析为一个dict,您可以测试它是否具有您要查找的device_id值,在这种情况下,文档将进入输出:

import yaml
from itertools import groupby

with open('log.txt') as file:
    for has_content, lines in groupby(file, '\n'.__ne__):
        if has_content:
            data = yaml.safe_load(block := ''.join(lines))
            if data['device_id'] == 'do0142':
                print(block)

字符串
因此,给定以下内容作为输入内容:

{ device_id: 'do0142', message: '[0,"xyz","something",{}]'}

{
  device_id: 'Ampn05',
  message: '[0,"23","something",{"connect":1,"error":"something","info":"xyz","valid":"Unavailable","timestamp":"2020-03-15T04:33:32Z","vendorId":"cycle","country":"anywhere"}]'}

{ device_id: 'do0142', message: '[0,"xyz","something",{}]'}

{
  device_id: 'do0142',
  message: '[0,"23","something",{"connect":1,"error":"something","info":"xyz","valid":"Unavailable","timestamp":"2020-03-15T04:33:32Z","vendorId":"cycle","country":"anywhere"}]'}


代码将输出:

{ device_id: 'do0142', message: '[0,"xyz","something",{}]'}

{ device_id: 'do0142', message: '[0,"xyz","something",{}]'}

{
  device_id: 'do0142',
  message: '[0,"23","something",{"connect":1,"error":"something","info":"xyz","valid":"Unavailable","timestamp":"2020-03-15T04:33:32Z","vendorId":"cycle","country":"anywhere"}]'}


请注意,我已经修复了示例输入中所有message中未终止的引号字符串,这可能是由于您努力最小化问题的输入而导致的格式错误。
演示:https://replit.com/@blhsing1/ImpracticalGreatInsurance

相关问题