regex Python文本提取到字典

omjgkv6w  于 2023-06-25  发布在  Python
关注(0)|答案(1)|浏览(114)

我是一个新的编码,谁能帮助我转换下面的文本集使用正则表达式或任何其他技术的字典。
Bus Number: Departure,将在所有消息/块中通用

KPN_Sleeper: Bus Number: Departure 
Bus code: Kpn-866489 KA-01-7233 Bangalore 
AC Sleeper/56 Seats
24 Seats booked 

SRS: Bus Number: Departure 
Bus code: SRS-5858 KA-31-5985 Bangalore 

SAM: Bus Number: Departure 
Bus code: SAM-0077 TN-23-0777 Chennai
{0:{
  "Bus_name": "KPN_Sleeper",
  "Bus code":"Kpn-866489",
  "Bus Number": "KA-01-7233",
  "Departure": "Bangalore",
  "others": "AC Sleeper/56 Seats 24 Seats booked "
},
1:{
  "Bus_name": "SRS",
  "Bus code":"SRS-5858",
  "Bus Number": "KA-31-5985",
  "Departure": "Bangalore",
  "others": ""
}}

由于我是编码和正则表达式的新手,我觉得很难构建。

fslejnso

fslejnso1#

考虑到你的评论,我认为你可以试试这个:

^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?

Regex Demo
示例代码(run here):

regex = r"^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?"

test_str = ("KPN_Sleeper: Bus Number: Departure \n"
    "Bus code: Kpn-866489 KA-01-7233 Bangalore dfdf\n"
    "AC Sleeper/56 Seats\n"
    "24 Seats booked \n\n"
    "SRS: Bus Number: Departure \n"
    "Bus code: SRS-5858 KA-31-5985 Bangalore dfdf dfd\n\n\n"
    "SAM: Bus Number: Departure \n"
    "Bus code: SAM-0077 TN-23-0777 Chennai \n"
    "asdfadf ;kasdjlfads;f lkadsjf")

matches = re.finditer(regex, test_str, re.MULTILINE)

for match in matches:
    print("Bus Name: "+match.group(1)+"Bus Code: "+match.group(2)+" Bus No: "+match.group(3)+" Departure: "+match.group(4))

#you can have other's value in match.group(5) , however, having it is conditional

说明:

  1. ^(.*):\s* (.*)-->获取总线名称的第一个捕获组。\s*覆盖空白区域
  2. Bus Number: Departure\s*\n-->公交车号:离开,后跟空格和换行符
  3. Bus code:\s*下一行以Bus Code(总线代码)开头,加上冒号和选项空格
  4. ([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*
    a)([^ ]+)-->总线代码\s-->空格
    b)([^ ]+)-->总线号\s-->空格
    c)([^\n]+)--> Departure,可以有多个单词
    d)[ \t]*-->包括离港后的尾随舱位
  5. (?:\n|$)-->覆盖字符串的换行符或结尾
  6. ((?:[^\n]+(?:\n|$))+)?
    a)[^\n]+(?:\n|$-->匹配除换行符后跟换行符或字符串结尾以外的任何内容
    b)?:使其成为非捕获组
    c)+意味着可以有多条线
    d)最后的()对一组中的所有other行求和
    e)?使得整个other过程是可选的

相关问题